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ABSTRACT 


Since  1951  when  Robbins  &  Monro's  pioneering  paper  on 
stochastic  approximation  was  published,  many  articles  have 
appeared  dealing  v^ith  extensions,  modifications,  methods 
and  applications  of  stochastic  approximation.   While  the 
concepts  involved  are  relatively  simple,  but  mathematically 
difficult,  the  information  concerning  specific  results  has 
been  widely  scattered  and  difficult  to  collect  for  the 
interested  researcher.   This  paper  will  attempt  to  discuss 
the  major  results  and  will  provide  the  necessary  references 
to  direct  the  user  to  more  specific  findings. 
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I.   INTRODUCTION 

In  many  areas  of  analysis  in  bloassy,  sensitivity 
testing  or  learning  we  are  concerned  with  a  level  of  output, 
Y,  given  a  certain  level  of  some  input,  x.   For  each  given 
level  of  X,  the  resultant  output  is  not  deterministic  but 
has  some  underlying  probability  distribution,  F(Y|x). 
Hence  it  is  then  common  to  refer  to  the  response  function 
of  X,  denoted  M(x),  as  the  expected  value  of  Y  given  x. 

CXJ 

(I.e.,  M(x)  =    /  Y(x)dF(Y|x)  =  E[Y|x].) 

In  usual  analysis  of  the  response  function,  M(x),  it 
is  assumed  that  the  function  is  of  known  form  with  unknown 
parameters  say: 


M(x)  =  Bq  +  B  X  +  B^x^  +  ... 


where  the  param.eters,  B .  ,  are  estimated  on  the  basis  of 

observations  Y, ,  Yp ,  ...,  Y   corresponding  to  observed 

values  X-,,  x„,  ...  x  .   The  method  of  least  squares,  for 
1'   2  *       n  -I      J 

example,  yields  the  estimators  of  B.  which  minimize  the 
sum  of  the  squared  errors. 

However  cases  often  arise   in  which  one  has  little 
prior  knowledge  of  the  actual  form  of  M(*)  or  one  is  only 
interested  in  trying  to  estimate  the  value  0  such  that 

M(e)  =  a 


where  a  is  a  specific  desired  response  level.   VJe  desire 
to  find  a  sampling  scheme  such  that  X  ->■  G . 

Robbins  and  Monro  [Ref.  88]  presented  the  following: 


THEOREM :   Let  M(x)  be  a  given  function  and  a  a  given  constant 
such  that  the  equation  M(x)  =  a  has  a  uniquely  defined  root* 
X  =  6.   Let  Y(x)  denote  a  realization  of  an  experiment  at 
"control  level"  x.   Assume  Y(x)  has  distribution 

00 

P(Y(x)  <  Y)  =  H(Y|x)  such  that  M(x)  =    /  YdH(Y|x). 

_00 

(I.e.,  M(x)  =  E(Y|x).)   Choose  X^  arbitrary  and  define  the 

recursive  relation:  x,,  =x  +a{a"Y(x)}. 

n+1    n    n      n  n 

If  there  exists  a  positive  constant  C  such  that 


P[fY(x)|  <  C]  =  1 


and  if 


for  some  6  >  0 

M(x)  <  a  -  6   for   x  <  6 


and 


M(x)  >  a  +  5   for   x  >  6 


^      Mote  that  this  requires  that  for  some  6  >  0, 
M(x)  <_  a  -  5    for   x  <  9   and  M(x)  >_  a  +  6   for   x  >  0, 
but  does  not  specifically  require  that  M(9)  =  a. 


then  for  a  =  1/n 
n 


Lim  E[(x   -  e)^]  =  0 
n 


The  procedure  of  recursively  defining  x   .,  as  a  function 


of  X   by 
n  "^ 


X  , T  =  X  +  a  (a  -  Y(x  ) ) 
n+1    n    n        n 

is  referred  to  as  the  Robblns-Monro  method  or  procedure. 
(Note  that  the  process  is  a  first  order  Markov  process, 
although  it  is  in  general  non-homogeneous.)   Papers  which 
followed  Robbins  and  Monro's  discussed  topics  such  as 
convergence,  finding  the  maximum  (or  minimum)  of  a  function, 
multidimensional  applications,  and  accelerated  processes 
to  name  a  few. 

In  the  first  few  years  of  stochastic  approximation 
survey  papers  by  Derman  (1956)  [Kef.  25],  Schmetterer  (I960) 
[Ref.  101],  .and  Loginov  (1966)  [Ref.  8l]  presented  major 
results  through  their  respective  date  of  publication.   A 
text  on  the  subject  v;as  attempted  by  Wasan  (1969)  [Ref.  129] 
but  received  strong  criticism  because  of  serious  oversights 
and  many  misprints.    While  the  aforem.entioned  publications 
contained  only  the  mathematical  formulation,  other  treatments 


1   Dupac,  v..  Book  Review,  Annals  of  Mathem.atical 
Statistics,  v.  ^H ,  p.  1131,  1970. 
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by  Fu  [Refs.  53,  5^1  and  Wetherill  [Ref.  I30]  contained 
predominantly  practical  and  intuitive  information  and  little 
mathematical  background.   This  paper  will  attempt  to  present 
the  major  results  of  both  mathematical  formulation  and 
practical  applications  and  to  discuss  the  intuitive  meaning 
where  it  is  applicable.   The  list  of  references  is  Intended 
to  be  as  complete  as  possible  on  the  subject.   Consequently 
many  of  the  bibliographical  entries  are  not  specifically 
referenced. 


II.   MOTIVATING  STOCHASTIC  APPROXIMATION 

In  certain  applications,  as  in  bioassy,  sensitivity 
testing,  or  fatigue  trials,  the  statistician  is  often 
interested  in  estimating  a  given  quantile  of  a  distribution 
function  or  a  level  of  response.   Situations  of  this  type 
are  candidates  for  solution  by  stochastic  approximation 
methods.   Examples  of  these  situations  are: 

A.   QUANTILE  ESTIMATION 

Suppose  we  are  testing  the  resistance  of  a  metallic 
component  to  fatigue  fracture.   Let  F(x)  denote  the  proba- 
bility that  a  specimen  will  fail  if  subjected  to  x  cycles 
in  a  trial.   Then  a  specimen,  when  tested  in  such  a  way, 
represents  an  observation  which  takes  on  a  value  one  or 
zero  depending  on  whether  or  not  it  fractures  in  x  cycles. 
Thus  in  the  notation  of  the  previous  section,  Y(x)  =  1  if 
the  specimen  fractures  and  Y(x)  =  0  otherv/ise,  so  that 
M(x)  =    /  Y(x)dF(x)   =   F(x).   It  is  of  interest  to 
estimate  the  number  of  cycles,  x,  such  that  for  a  given 
a,   P(x)  =  M(x)  =  a. 

1.   LD^Q 


We  wish  to  administer  sample  doses  of  a  drug  to 
laboratory  animals,  say  rats,  such  that  we  determine  the 
dosage  such  that  50%  of  them  die  on  the  average.   In  this 
case  a  =  . 5  in  our  problem  formulation  and  we  desire  to 
solve  M(x)  =  .5  for  x. 
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B.  LEVEL  DETERMINATION 

Suppose  in  production  it  is  desired  to  find  the  level 
of  some  material  such  that  a  characteristic,  say  the 
viscosity  of  the  finished  product,  is  a  pre-determined 
level'.   However  each  batch  is  subject  to  impurities  and 
reacts  as  a  stochastic  realization.   A  stochastic  approxi- 
mation scheme  may  be  devised  to  automatically  set  and 
correct  the  desired  input  flov;  to  produce  the  desired  results 

C.  ROUND-OFF  ERRORS 

As  stated  by  Schmetterer  [Ref.  101]  we  can  consider 
an  application  of  the  FIM  process  for  the  problem  of  round- 
off errors.   This  problem  occurs,  for  exam.ple,  if  one  solves 
equations  by  classical  iteration  process  using  electronic 
computers.   Define  for  every  real  number  X  a  random  variable, 
Y(x),  in  the  follov;ing  way. 

P[Y(x)  =  [x]]  =  1  -  (x  -  [x]), 
P[Y(x)  =  [x]  +  1]  =  x  -  [x]*   . 

Note  that  E[Y(x)]  =  x.   From  here  we  can  deduce  as  a 
pattern  for  more  general  theorems  the  following  result.   If 
one  solves  a  linear  equation  by  an  iterative  procedure  and 
modifies  it  by  using  for  every  step  of  the  iteration  the 


Note  that  [x]  denotes  the  largest  interger  contained  in 
X.   For  example  [2.8?]  =  2. 


round-off  rule  given  above,  then  the  modified  procedure 
converges  with  probability  one  to  a  solution  of  the  given 
equation. 
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III.   THE  ROBBINS-MOMRO  METHOD 

In  the  introduction  the  first  of  two  theorems  from  the 
original  Robbins-Monro  paper  was  presented.   This  first 
theorem  required  that  the  response  function,  Y(x),  be 
bounded,  allowed  discontinuities  in  the  function  M(x)  =  E[Y(x)], 
and  did  not  specifically  require  that  M(x  =  G)  =  a,  the 
desired  response  level.   The  second  theorem  of  Robbins-Monro 
is  presented  below: 

THEOREM  [Ref.  88] 

Let  the  sequence  a  be  of  the  form  1/n  and  assume  there 
^       n 

exists  some  constant  C  >  0  such  that 

P{|Y(x)|  <  C}  =  1, 

and  that  the  conditions 

(i.)    M(x)  is  a  nondecreasing  function, 
(ii.)    M(0)  =  a, 
(iii.  )   M' (6)  >  0 

are  satisfied.   Then  defining  the  recursive  relation 


x^_,  =  X   +  a  [a  -  Y(x)] 
n=l    n    n 


implies  the  result 
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Lim  E[(x   -  e)^]  =  0 
n 


A.   WEAKENING  THE  CONDITIONS  FOR  CONVERGENCE 

Wolfov/ltz  [Ref.  132]  in  response  to  questions  of 
Robbins  and  Monro  showed  that  if  the  conditions  of  the 
response  function,  M(x),  satisfy 


iM(x)|  <  C, 


P  ? 

/  (Y(x)  -  M(x))  dF(Y|x)  <  a   <  +  «, 


along  with 

M(x)  <  <»    for  X  <  e, 

M(e)  =  a, 

M(x)  >  a    for  x  >  0, 

M(x)  strictly  increasing  when  |x  -  0|  <  6  for  some  6  >  0 
and 


inf   |M(x)  -  a|  >  0. 

|x-e|  >_  6 


Then  for  x   defined  as  in  the  RM  process  x   converges  in 
n  ^        n       ^ 

probability  to  0. 


1^ 


B.   CONVERGENCE  WITH  PROBABILITY  1 

Kallianpur  [Ref.  68]  and  Blum  [Ref.  6],  both  proved  a 
convergence  which  is  stronger  than  convergence  in  mean 
square  (which  implies  convergence  in  probability),  conver- 
gence' with  probability  1.   Blum  [Ref.  6]  proved  that  the  RM 
process  converges  with  probability  1  under  conditions  even 
weaker  than  those  of  VIolfov/itz  [Ref.  132].   While  VJolfowitz 
required  that  the  regression  function  M(*)  be  bounded, 
Blum  only  required  that  it  lie  between  two  lines. 

BLUMS'  THEOREM  [Ref.  6] 

Let  M(x)  be  the  regression  function.   We  assume  that 
M(x)  is  measurable  and  satisfies  the  following  conditions: 

M(x)  <  C  +  d|xl 

for  some  C,  d  >  0 


/  (Y  -  M(x))^dP(Y|x)  <  a^  <  +  00, 


M(x)  <  a    for   x  <  6, 


and 


M(x)  >  a    for  x  >  9  , 


Inf   1m(x)  -  a|  >  0 
6^<_|x-e|<52 
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for  any  pair    (6^,    6p),      if  moreover 


Z  a   =   +  °°j 


and 


~        2 

Z   a        <   +   00 

n=l 


then  X      converges   to    G   with  probability   1 


I.e. 


Pdim  X      =    0}    =    1. 
n-vcc     n 


C.   A  FURTHER  WEAKENING  OF  CONDITIONS  FOR  CONVERGENCE 

In  1963  Friedman  [Ref.  51]  further  weakened  the  requir 

ments  for  convergence  with  probability  1  by  removing  the 

2 
necessity  for  M(x).and  a  (x)  to  be  bounded  by  a  linear 

function  and  a  constant  respectively.   Friedman's  theorem 

is  presented  here : 

THEOREM   [Ref.  51] 

Let  f(x)  be  a  function  v:hich  is  positive  and  "bounded 
in  any  finite  interval.  Let  the  following  conditions  be 
satisfied: 
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M(x)  <  (L  |x|  +  K)f(x) 


for  constants  L,  K,  M  >  0, 


a^(x)  <  a^f(x). 


M(x)  <  a   for   x  <  6, 


and 


M(x)  >  a   for  x  >  0, 


inf   |M(x)  -  a|  >  0 
6-j^<,|x-e|<62 


for  any  pair  (6-,5p),  then  the  sequence  defined  by 


X  ^,  =  x   -  a  [a  -  Y(x  )]/f(x  ) 
n+1    n    n        n      n 


converges  to  9  with  probability  1. 

This  theorem  of  Friedman  enables  one  to  construct  a 
convergence  process  when  [mCx) |  and  a  (x)  are  bounded  by 
known  functions  f-,  (x)  and  fp(x).   One  then  takes 
f(x)  =  Max[f, (x) , (fp(x) ) ^] .   This  procedure  is  also  applicable 
where  f(x)  is  decreasing  to  zero  for  large  values  of  x. 
However  the  convergence  is  relatively  slow. 
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Later  Gladyshev  [Ref.  57]  simplified  the  conditions 
for  convergence  with  probability  1  v/ith  the  following: 

THEOREM 

Let  M(x)  be  a  measurable  function  such  that 


inf    (x  -  e){M(x)  -  a}  >  0    for  all  e  >  0. 
e<|x-e| <e-l 


Moreover  assume  there  exists  a  positive  number  d  such  that 
for  all  X  we  have 


E[Y^(x)]  <  d(l  -  x^) 


and  if  {a  }  satisfies  the  conditions  previously  stated  then 
X   converges  to  6  with  probability  1  where  x   is  defined 
as  in  the  original  P-M  process. 

D.   THE  MULTIDIMENSIONAL  CASE 

Blum  [Ref.  7]  was  the  first  to  generalize  the  Robbins- 
Monro  process  to  the  multidimensional  case.  He  considered 
the  following  problem: 

Assume  that  we  are  given  a  family  of  N  random  variables 

with  distribution  functions 

F^(-),...,P^(-) 

18 


moreover  assume  that 

00 

M.(x,,...x  )  =    /  Y.dF,    for  i  =  1,...,N, 
1   1*    n         1   i  >    >  3 

_oo 

(I.e.  M.  is  the  corresponding  regression  function  in  the 
ith  dimension) . 

It  is  desired  to  construct  a  sequence  whose  limit  is 
the  root  vector  of  the  system  of  equations 


M.(Xt,...,x  )  =  a.. 
1   1'    '  n     1 


For  simplicity  it  is  assumed  that  all  a,  =  0  and  that 


M^(0)  =  0. 


Let  f(x)  be  a  real  function  that  is  defined  on  real 
N-dimensional  space  and  has  continuous  first  and  second 
derivatives.   Let  A(x)  =  (8  f/3x.8x.)  denote  the  matrix  of 
second  derivatives  and  let  D(x)  =  (8f/8x.)  denote  the  vector 
of  the  first  derivatives. 

In  matrix  form  the  Robbins-Monro  process  is  of  the  form 

X   ,  =  x   +  a  Y   where  it  is  assum.ed  as  before  that 
n=l    n    n  n 

00  CO         p 

2^   a  =  0°,   and   Z-  a   <   +  °° . 
n=l  ^  n=l  ^ 

Observe  the  following  notation: 

Let  U(x)  =  <D(x),M(x))>   (i.e.  the  scalar  product  of.  D(x) 

and  M(x)). 
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Let   V    (x)    =   E{<Y    ,A(x   +   Qa  Y^)Y    >}      where    0    <   q    <    1. 

d.  X  XX  .      "" 

THEOREM  [Ref.  7J 

If  there  exists  a  real  function  f(x)  with  continuous 
derivatives  that  satisfies  the  conditions 

(i)    -fCx)  >  0, 

(il)   Sup   U(x)  <  0    for  e  >   0, 
I  |x I  I >e 

(iii)  Inf   |f(x)  -  f(0)|  >  0    for  e  >   0, 
||x||>c 

(iv)   V  <  V  <  +«    for  all  a, 
a  ■" 

then  the  sequence   {x  }   as  previously  defined  converges 
almost  surely  to  zero. 

It  should  also  be  noted  that  the  multidimensional  case 
is  a  direct  extension  of  theorems  by  Derman  and  Sacks  [Ref.  26] 
and  by  Gladyshev  [Ref.  57]  where  we  think  of  x  ,  Y  ,  8,  a, 
and  M(x)  as  N  dimensional  vectors  and  treat  multiplication 
as  the  vector  scalar  product.   Of  interest  is  a  special  case 
when  the  regression  function,  M(x)  is  linear,  M(x)  =  a 
where  M  is  a  symmetric  matrix.   The  following  m.odified  RM 
process  was  proposed  by  Dupac  [Ref.  33]  for  this  special 
case. 

Here  Y  is  a  random  vector  whose  distribution  function 
is  F(Y|Y   -  a)  where  Y   is  a  random  vector  v/ith  distribution 
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function  F(Y  jx  ).   Then  the  following  theorem  is  presented 
by  Dupac. 

THEOREM   [Ref.  33] 
Assume  that 


/  I  JY  -  M(x)  I  |^dF(Y|x)  £  a^  <  +  CO 


and 


2 
K^  =  min  X. 


are  satisfied  where  the  .X.    are  the  characteristic  roots 
of  the  matrix  M.   Then  if  a  >  1/2K, ,  then  the  sequence 


defined  by 


X  ,  T  =  X   -  -  Y 

n+1    n   n  n 


converges  to  6  with  probability  1  and 


E{i|x^  -  0||^}  =  0  (1/n) 


There  are  certain  situations  v;here  the  multidimensional 
case  can  be  reduced  to  a  one  dimensional  case.   The  results 
are  by  Eppling  [Ref.  37]  and  require  general  stochastic 
approximation  theorems  of  the  Dvoretzky  type. 
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E.   DVORETZKY'S  GENERALIZED  PROCESS 

Dvoretzky  [Ref.  36]  has  suggested  that  any  stochastic 

approximation  procedure  may  be  viewed  as  an  ordinary 

deterministic  (error  free)  successive  approximation  method 

with  a  noise  component  superimposed  on  it  at  each  step. 

On  the  basis  of  this  concept  a  very  generalized  class  of 

stochastic  approximation  theorems  can  be  studied. 

Assume  that  T  (x-,,...,x  )  is  a  Borel-measurable  sequence 
n   ].     '  n  ^ 

of  transformations  from  n-dimensional  Euclidian  space,  R  , 

^    *   n' 

into  R^ .   One  may  then  construct  the  sequence  from  the 
relation 


n+1    n   1'    '  n     n 


where  T  (x, ,...,x  )  is  the  error  free  transformation  and 
n  1 '    '  n 

Z   is  the  error.   Dvoretzky  then  proved  the  following  theorem. 

THEOREM   [Ref.  36] 

Let   {a  },   {B  }   and   {y  }   be  sequences  of  non-negative 
n      n  n 

real  numbers,  such  that. 


Lim  a  =  0, 
n-co   n 


IB   <  0°, 

n=l  " 


E  Y    =  °o. 

n=l^ 
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Moreover,  assume  that  the  condition 


It  (r, . . . ,r  )  -  el  <  Max  [a  ,(1+B  ) Ir  -el-y  1 
'  n   1    '  n      '  —       n*     n  '  n   I  'n 


is  satisfied  for  all  real  r, ,...,r  •  also  that 


CO 

E   E(Z  ^)  <  CO 
n=l 


and 


E(Z  |x, ,. . . ,x  )  =  0 
n   1'    '  n 


with  probability  one.*  Then  the  sequence   (x  }   defined  by 


X  .,  =  T  (x, , . . . ,x  )  +  Z 
n+1    n   1'    '  n     n 


converges  to  the  desired  quantity,  6,  in  mean  square  and 
with  probability  1.  I.e., 


lim  E[(x   -  0)^]  =  0, 
n     • 


and 


*   Note  that  this  condition  is  satisfied  if  the  Z 
are  a  sequence  of  independent  errors  for  v;hich   E(Z  )  =  0 
for  all  n.  ^ 
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P  lim  X   =  e   =--  1 


It  can  easily  be  shov^n  that  the  Robbins-Monro  procedure 
Is  a  special  case  of  Dvoretzky's  generalized  procedure.  To 
do  this  write  the  normal  R-M  relation 


as 


X  , T  =  X  +  a  {a  -  Y(x  ) } 
n+1    n    n        n 


X  _L-,  =  X  +  a  [a-M(x  )   +  a  [M(x  )-Y(x  )] 
n+1    n    n      n      nn     n 


Then  letting 


T  (x,,...,x  )  =  x  +  a  [a  -  M(x  )] 
n   1    '  n     n    n        n 


and 


Z   =  a  [M(x  )  -  Y(x  )] 
n    n    n       n 


we  have  the  KM  procedure  in  Dvoretzky's  format.   In  a 

similar  manner  the  Kiefer  -  Wolfowitz  procedure,  which  will 

be  discussed  later,  can  be  written  as  a  special  case  of 

Dvoretzky's  theorem. 

Dvoretzky  extended  his  generalized  procedure  even 

further  by  replacing  the  sequences  a  ,  B  ,  and  y   by 
J        t  o       M        n'n'      'n'' 
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non-negative  functions   a  (r-,,...r  ),   B  (r-,...r  ),   and 

Y  (r, ,...r  )   respectively  provided  that  they  satisfy  the 
nln 

follov/ing  conditions: 


(i)     the  functions  a  (r^,...,r  )  are 

n   1'    '  n 

uniformly  bounded  and  lim  a  (r-5...,r  )=0 

n^co  n   1'    '  n 

uniformly  for  all 
sequences  r, ,...jr  ,... 


(ii)    the  functions   B  (r-,...,r  )  are 

Q,  1'    '  n 

measurable  and  T.      B  {r^y....r    )  is 

n=l   ^         ^ 
uniformly  bounded  and  uniformly  conver- 
gent for  all  sequences   r-,...,r  . 


(iii)   the  functions   y  (^-,j''«jI^  )  satisfy 

n   1     '  n        "^ 

L     Y  (r-,...,r  )  =  00  uniformly  for 
T  'n   1 '    '  n  "^ 

n=l 

all  sequences  r,  ,...,r   for  v;hich 
^        1*    '  n 

SUP   |r^|  <  L   where  L  is  a  finite 

n=l,2  , . . . 

number. 


The  introduction  of  Dvoretzky's  general  conditions 

2 
allowed  regression  functions  of  the  form  M(x)  =  -x  or 

2 
M(x)  =  Exp  (-X  )   to  be  applicable  to  stochastic  approxima- 
tion type  theorems.   The  most  comprehensive  presentation 
of  Dvoretzky  stochastic  approxim.ation  theorems  has  been  by 
Venter  [Ref.  120]  in  1966.   Venter's  theorems  generalized 
the  work  of  Dvoretzky  [Ref.  36]  and  Wolfowitz  [Ref.  133]  for 
transforms  on  the  real  line,  of  Derman  and  Sacks  [Ref.  26] 
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for  finite  dimensional  Euclidian  spaces,  and  of  Schmetterer 
[Ref.  101]  for  Hilbert  spaces. 

Block  [Ref,  5]  had  proposed  a  more  general  type  of 
stochastic  approximation  taking  place  on  a  normed  vector 
space. 

F.   EXPECTED  SQUARED  ERROR 

While  Blum  [Ref.  6],  Dvoretzky  [Ref.  36],  and  Dupac 

[Ref.  32]  v;ere  establishing  conditions  under  which 

b   =  E[(x  -e)  ]  ->   0,  others  such  as  Chung  [Ref.  1^],  Hodges 

and  Lehmann  [Ref.  6^],  Kallianpur  [Ref.  68],  and  Schmetterer 

[Ref.  101]  were  trying  to  establish  bounds  on  b  .   (Note 

that  b   =  variance  +  (bias)  .)   Below  is  Schmetterer 's 
n  . 

result  for  the  bounded  case. 

THEOREM   [Ref.  101] 

Let  M(x)  be  a  Borel-Measurable  function  that  satisfies 

(i)     P{  |Y(x)  I  <_C}  =  1   for  some  constant  |  C  |  <+°° 
and 

(ii)    (x-0){M(x)-a}>O        for  all  x?^0 . 

Also  there  exists  an  £  >  0  ,  and  positive  constants  C, ,  Cp , 

such  that 

(iii)  |M(x)-a|  _>  C,  |  x-0 1      for  |x-e|_<e   , 

(iv)  |M(x)-a|  >_  C^  for  |x-e|>£ 
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Then 


n-1    C^a    n-1    c^a.   n-1   ^    i     C,a    , 
^n  1  ^1   n  (1-^)+   n  (1-^i-)   E  a/e.[  n  (1-^)]-^ 
"^    ^   i=l    ^i-1   1  =  1    ^i-i   1=1  ^   ^  r=l    ^r-1 

2  " 

where   e.  =  E  (y.-a)   ,   a^  is  defined  as  1,  A  =   E  a   , 

1=1 
and  there  exists  a  constant,  C^,  such  that 

|M(x)-a|  >  g.  ^-  |x  -0|    with  probability  1. 
n-1 

As  was  noted  above  this  theorem  holds  for  the  so  called 
"bounded  case."  This  same  result  holds  for  the  quaslllnear 
case  If  conditions  (1),  (111),  and  (Iv)  are  replaced  by  the 
following  conditions: 

There  exists  a  C^  such  that 

E{[Y(x)-M(x)]^}  <  C^ 

and  the  quaslllnear  condltons  that  there  exist  Cj-  and  C/-, 
C(-  <  C/-  such  that 

Cclx  -0|  >  |M(x)-a|  >  Cr\x   -el  . 

Then   the   above   estimate   of  b      holds   when   a.    Is    substituted 

for   a. /A.    T    and   2C^    Is    substituted    for   C-,. 
1      1-1  0  3 

Therefore    if   a  =a/n  v/here   a    >  1/2C^   then  b      =    0(l/n). 
n  on 

This  latter  is  the  most  frequently  used  result. 
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G.   EXPECTED  SQUARED  ERROR  IN  THE  LINEAR  CASE 

Hodges  and  Lehmann  [Ref.  64]  analyzed  in  detail  the  case 
where  it  is  desired  to  estimate  the  value  of  x  for  v/hich 
M(x)=0  where  it  is  assumed  that  M(x)=3x,  and  variance  (Y(x)) 
=0    .      Then  using  the  RM  process  to  define  {x  }  yields 

^      ^  n-1  ^n-1   ^   n-1        ^ 

b   =  E[x  "^1  =  x/[  n  (l-3a  )T   -^   o"^    I   Q^  n   (1-a  )'^  . 

r=l  r=l    s=r+l 

In  analyzing  this  expression  it  becomes  obvious  that  the 
first  term  represents  the  expected  bias  based  on  the  initial 
choice,  x, ,  while  the  second  term  represents  the  variance 
component  of  the  error  variance.   Since  Chung  [Ref.  l4] 
established  that  under  certain  conditions  the  sequence 
a  =c/n  gives  most  rapid  convergence  of  x  to  0,  it  is  of 
Interest  to  analyze  the  expression  for  expected  squared 
error  for  this  family  of  coefficients. 


rm 


For  the  first  (expected  bias  to  initial  value,  x-,)  te 
the  expected  bias  =  0(n~  ^^)  if  {c/n)~^    >_   3   for  all  n, 
but  becomes  quite  large  if   (c/n)~   <  3  .   Hence  as  noted 
by  Wetherhill  [Ref.  130]  it  would  be  more  desirable  to  tend 
to  overestimate  c  =  3"   rather  than  underestimate. 

The  analysis  of  the  second  term  is  more  complicated  but 
Hodges  and  Lehmann  have  shown  that  it  is  asymptotically 
equivalent  to 
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2  2 

a  c 


n(2c3-l) 


2 

o   log  n 
e 


if   c  >  1/23 


Hn^^ 


if   c  =  1/2B 


and  we  note  that  c  should  not  be  less  than  (23)   because 
of  the  large  bias  which  results. 

These  results  give  us  conditions  on  the  sequence, 
a  =c/n  in  terms  of  3,  the  slope  of  the  regression  function, 
such  that  the  bias,  resulting  from  an  initial  bad  guess, 
rapidly  tends  to  zero  v;ith  increasing  sample  size  and  the 
expected  squared  error  of  x   is  of  the  order  0(l/n).   The 
main  shortcoming  of  the  linear  model  is  that  we  do  not  know 
how  nearly  linear  M(x)  must  be  nor  how  nearly  constant 
variance  (x)  must  be  in  order  that  the  linear  approximation 
will  represent  what  actually  happens.   The  only  evidence  or 
this  point  consists  of  a  sampling  experiment  by  Teichroew 
[Ref.  109].   There  it  was  found  that  the  linear  theory  is  :'  1 
reasonable  agreement  with  the  data. 

H.   RATE  OF  CONVERGENCE 

In  a  recent  paper  Komlos  and  Revesz  [Ref.  13^1  presented 
estimates  of  the  rate  of  convergence  of  the  R-M  process  in  a 
form  more  concise  than  any  previous  result.  They  considered 
the  case  a  =  0,  0  =  0,  a  =  1/n  and  presented  the  following 
estimates. 
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For  the  case  where  there  exists  L  >  0  such  that 


P[|Y(x)-M(x)  I  <_  L]  =  1   and   Lim  M(x)=M(°=)  >L, 

x->°o 

if  the  conditions 

M(x)  <  c.x  +  d^        if  x  >  e  =  0 

M(x)  >  c^x  +  d         if  X  <  6  =  0  , 

are  satisfied  for  positive  constants  c.,  Cp,  d^  ,  dp. 


then         P[x  >e]  _<  e 


-yn 


for  any  e  >  0,  where   y  -   y(c)  >  0. 

For  the  case  where  M(<»)  <  L  the  rate  of  convergence 

is  much  slov/er  than  for  the  previous  case.   Specifically 

M(°o)    , 

"~L 

P[x  >e]  <_  exp(-n        ) 

for  any  5  >  0,  n  >  nQ(6). 

I.   ASYMPTOTIC  NORMALITY 

The  asymptotic  .behavior  of  the  higher-order  moments  an^ 
the  asymptotic  distribution  of  the  random  variables  defined 
by  the  sequence  {x  }  was  first  considered  in  detail  by 
Chung  [Ref.  1^].   His  method  is  based  on  the  moments  of 
(x  -0)  and  his  results  have  been  widely  used  in  papers  on 
stochastic  approximation.   Chung's  fundamental  result  is  for 
the  "bounded  case"  and  can  be  stated  as  follows: 
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THEOREM  [Ref.  l4] 

Let  M(x)  be  a  Borel-measurable  function  that  satisfies 
the  follov;ing  conditions 

P{|y(x)|  <_  C-j^}  =  1     for  some  constant,  C,, 

(x-e){M(x)-a}  >  0   , 

M(x)  =  a  +  a  (x-e)  +  0(|x-e|)  , 

inf     |M(x)-a|  =  K„(6)    0  . 
|x-e|>6  ° 


and 


E[(Y(x)-M(x))^]  <  a^  <  «>    for  all  x  . 


Let  a  =  1/n     , 


^^^^^         2TiTcry  <  ^  <  I 


and  where  Cp  is  determined  by  the  condition 

|M(x^)-a|  >  C^en'^lx^-el 
Then  for  any  integer  r  >  1 


/,   sr  0  if  r  is  odd, 

Lim  n^-^  ^^2   E[(x  -6)^]  ={   p     r 
n->~  .\(a  /2a,)2'  (r-1)   if  r  is  even, 

Cl-e)- 
and  the  random  variable  n     2  (x  -6)  is  asymptotically 

2 
normally  distributed  with  mean  zero  and  variance  =  a  /2a, . 

A  similar  result  was  obtained  by  Chung  for  the  quasi- 
linear  case  (i.e.  M(x)  lies  between  two  straight  lines  with 
nonvanishing  slope).   In  this  case  the  boundedness  assumption 
is  replaced  by 
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K|x-0|  <_  |M(x)-a|  £  K-||x-e|      for  K  >  0,   K^<  co 
and 

E[  {Y(x)-M(x)  }^]  £  Kp  <  <»        for  p  an  even  integer. 

Then  using  a   =  c/n  v;here  c  >  1/2K,  the  distribution  of 

1/2 

n    (x  -6)  tends  to  normal  with  mean  zero  and  variance  = 

2  2 
a  c  /(2a-|C-l). 

For  the  special  case  where  M(x)  is  linear  Chung  proves 
asymptotic  normality  by  using  characteristic  functions  in  a 
very  concise  proof.   While  Hodges  and  Lehmann  [Ref.  6^] 
improved  some  of  Chung's  results,  Sacks  [Ref.  95]  utilized 
a  central  limit  theorem  for  dependent  random  variables  to 
obtain  more  general  and  more  complete  results  about  the 
asymptotic  normality  of  X  .   Below  is  a  theorem  of 
Gladyshev  [Ref.  57]  that  is  a  strengthened  form  of  Sack's 
fundamental  result. 

THEOREM  [Ref.  57] 

Assume  that  the  following  conditions  are  satisfied: 

inf     (x-0)(M(x)-a)  >  0        for  £  >  0  , 
e<|x-0|<l/c 

M(x)  =  a  +  a-j^(x-9)  +  5(x,e)(x-0)   , 

there  exists  a  d  >  0,  such  that  for  all  x, 
E[Y^(x)]  <  d(l+x^)  , 

Lim  E[(Y(x)-M(x))^]  =  p  >  0   , 

Lim  Lim    Sup      E[ (Y(x)-M(x) ) ^$  (x) ]  =  0   , 
N^oo   £;->-o    |x-0|<e 
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where 


and 


'l   for   |Y(x)  I  >  N  , 
^^       '  0   for   |Y(x)  I  1  N  , 


a  =  An"   is  such  that   Aa,  >  1/2  . 


1/2 
Then  the  distribution  of  n    (x  -9)  tends  to  normal  with 

2        -1 
mean  zero  and  variance  =  A  (2Aa^-l)   Tip  . 

J.   SELECTION  OF  STEP  SIZE,  a 

As  we  have  noted  thus  far  the  sequence  (a  }  must 
essentially  have  the  same  asymptotic  behavior  as  the 
harmonic  series,  1/n,  which  satisfies  the  conditions 

00  00 

2 
H  a  =  <»    ,   and      E   a   <  •»   . 


We  can  intuitively  see  that  the  first  condition  is  necessary 
to  guarantee  that  the  sequence,  (x  }  ,  does  not  get  trapped 
in  any  finite  interval  while  the  second  condition  is  neces- 
sary for  the  convergence  of  the  expected  squared  error  term. 
However  it  is  reasonable  to  ask  if  there  is  a  sequence, 
{a  }  ,  which  minimizes   E[(x  -6)  ]   after  some  fixed  number 
of  observations,  say  N.   Dvoretzky  [Ref.  36]  solved  this 
problem  for  the  Robb ins -Monro  Process. 

THEOREM  [Ref.  36] 

Assume  that  a  random  variable,  Y(x),  satisfies  the 
conditions 

(i)   E[Y^(x)]  <  a^  <  CO 
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and  assume  that  M(x)  is  such  that 


(li)   0  <  A  <  !l(iii-ze  <  B  <  » 
—   X  -  e   — 


and  if  it  is  known  that 


(iii)   Ix-,  -  e|  <  C  < 


2a2 


1  -  -  I  -  -  _   V  A(B  -  A)  • 


2 


Ac 
Then  if  the  sequence,   a   =  — p p   is  used  in  the  RobbinS' 

"   a"^  +  nA 

Monro  process,  then  the  resultant 


E[(x„  -  e)'^]  <     ^  "" 


^        -  a^  .+  (n-DA^c^ 


is  obtained.   The  choice  of   {a  }   here  is  optimal  in  the 
minimax  sense  in  that  for  any  other  choice  of   (a  }   there 
exist  Y(x)  and  x..  that  satisfy  conditions  (i)  and  (iii)  for 
which  the  above  bound  on  expected  squarred  error  does  not 
hold. 

Now  it  is  obvious  that  this  information  is  of  limited 
use  to  the  experimenter  who  has  little  a  priori  information 
with  which  to  choose  a  .   Therefore  for  practical  choice  of 
the  sequence,  a  ,  the  reader  is  directed  to  Section  V.A. 
where  this  problem  is  discussed. 

K.   ACCELERATING  CONVERGEINCE 

When  the  initial  guess,  x, ,  is  far  from  the  desired 
value  of  0,  the  Robbins-Monro  procedure  approaches  6  very 
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slowly  because  we  are  taking  smaller  and  smaller  steps. 

Kesten  [Ref.  69]  proposed  the  method  of  accelerating  the 

convergence  of  a  stochastic  approximation  algorithm  based 

on  not  decreasing  the  step  size,  a  ,  if  the  difference 

(x   -  X   -,  )  has  the  same  sign  as  (x   ,  -  x   ^),  and  decreasing 
n    n-1  ^       n-1    n-2  '  ' 

the  step  size  if  the  signs  differed,  indicating  that  we 
may  be  in  the  region  of  6.   (Higher  order  schemes  are  also 
proposed. ) 

It  was  shown  that  there  exists  a  9',  not  necessarily 
identical  with  0,  about  which  fluctuations  in  sign  occur 
more  frequently  in  a  finite  number  of  trials.   The  value 
of   X  =  6'   is  defined  by  the  intersection  of  the  line 
Y(x)  =  a     and  the  locus  of  medians  of  the  densities 
^(F(Y|x))  for  any  x.   If  the  density  ~(F(Y|x))   is 
symmetric,  then   G'  =6.   Even  if  the  fluctuations  occur 
about  a  6'   different  from  6,  x   still  converges  in 
probability  to  6  as  Kesten  proved. 

Authors  such  as  Odell  [Ref.  87],  Sinha  and  Griscik 
[Ref.  105],  Sielken  [Ref.  104],  and  Newbold  [Ref.  86]  have 
presented  accelerated  stochastic  approximation  methods  of 
their  own  and  have  compared  them  with  the  original  R-M 
method  and  Kesten 's  method. 

Another  method  of  accelerating  convergence  was  proposed 

by  Fabian  [Ref.  40].   This  method  is  an  analog  of  the  method 

of  steepest  ascent  (descent).   Fabian  proposed  that  the 

step  a_  be  determined  in  the  following  manner:   for  given  x 
n  ^       '      ^      n 

and  y   one  makes  a  series  of  observations,  V.,  (where  the 
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observations  are  assumed  to  be  independent  of  x   and  y  ) 

of  the  quantity  M(x   +  jay  )  for  j  =  1,2...  until  sign 

Vt  =  ...  =  sign  V.  ,  =  sign  V.  =  -sign  V.,,.   Then  choose 
1  3-1  J  J+1 

a   =  ja.   (Note  here  a  =  0  =  M(9).)   Fabian  proved  that  under 

very  general  conditions  on  V.  iteration  methods  converge 

J 

with  probability  1. 

Authors  who  are  interested  in  the  practical  or  experi- 
mental aspects  of  stochastic  approximation  have  suggested 
that  the  approximation  method  be  carried  out  in  two  stages. 
The  first  stage  would  take  large  steps  to  estimate  the 
region  of  interest  while  the  second  stage  would  take  pro- 
gressively smaller  steps  and  represents  the  fine  tuning 
stage.   (See  Davis  [Ref.  22],  Wetherill  [Ref.  130],  and 
Goodman,  Lewis  and  Robbins  [Ref.  58].) 

L.   CONFIDENCE  INTERVALS  AND  STOPPING  TIMES 

After  k  iterations  it  may  be  desired  to  obtain  an  estimate 
of  Y  and  d  such  that 


^^\\+i   -  e|  <  d)  >  1  -  2y 


Farrell  [Ref.  50]  did  some  of  the  first  work  on  confidence 
intervals  of  bounded  length  but  required  a  priori  knowledge 
of  a  bounded  interval  containing  9. 

The  subject  of  stopping  times  of  a  non-parametric  nature 
is  an  almost  untouched  area.   Farrell  stated  that  Mrs.  Nancy 
Tapper,  Cornell  University,  had  been  studying  closed  stopping 
rules  and  bounded  length  confidence  interval  procedures  for 
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the  median  of  a  distribution  function.   However  very  little 
has  appeared  in  stochastic  approximation  literature  concerning 
stopping  rules. 

The  most  general  discussion  of  stopping  times  based  on 
the  asymptotically  normal  result  was  recently  presented 
by  Sielken  [Ref.  104]  and  is  stated  below. 

Using  the  definition, 

Z(x)  =  Y(x)  -  M(x) 
consider  the  following  conditions; 


(1)  Y  is  a  positive  constant  less  than  1/2. 

(2)  The  sequence  {c  }  of  positive  constants 

is  such  that  c  Y   "^  c   as   n  ->-  o°,  for 

n  ' 

some   0  <  c  <  CO. 

(3)  The  sequence  {a  }  has  the  form  An" 

n 

where  A  is  a  constant  such  that  2Aa   >  1 
(^)   M  is  a  Borel-measurable  function. 

(5)  For  e  >  0,   inf      M(x)  -  a  >  0 

e<x-0e  "-'- 
and   sup     ,  M(x)  -  a  <  0. 
e<e-x<£"' 

(6)  For  some  constants  K.,  and  Kp 

|M(x)  -  a|  £  K-^  +  K^lx-el   for  all  x. 

(7)  sup  E[|Z(x)|^]  =  W. 

(8)  Lim  E[|Z(x)|^]  =  E[Z(e)^]  =  a^  >  0. 

x->e 
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and 


(9)  Lirn  Lim   Sup  /  |Z(x)|^dP=0 
R^co  e->0    |x-e|  <c   |Z(x)  I  >R 

(10)  For  some  positive  constants  g  and  a.,  , 
if  |x-e|  <  g, 

then  M(x)  =  a  +  a^(x-e)  +  6(x), 
where  6(x)  =  o(|x-e|)   as  |x-0|   0. 

(11)  The  distribution  function  of  Y(x), 
denoted  F(Y|x),  is  such  that  for  every 
y>  F(y|*)  is  Borel-measurable . 

(12)  There  exists  e  >  0   such  that  for  every 

positive  integer  r 

Sup     E[  |Z(x)  1^]  <  oo. 
|x-0|<e 


Then  assuming  that  a  100(1  -'  2y)%   confidence  interval  on 
0  of  length  2d  is  desired,  the  proposed  stopping  time  for 
the  R-M  process  is  denoted  N,    ,  where  N,    ,  is  the 
smallest  positive  integer,  n,  such  that 


n  >  K  ^  A^  S   T^/(2At   ,  -l)d^. 
-  Y      i^jl      n,l 


The  principle  results  of  Sielken  are: 

THEOREM   [Ref.  104] 

If  conditions  (1)  -  (12)  above  are  satisfied  then 


Lim  N,  ^  y[K^  A^a^/(2Aa,  -  l)d^]  =  1   , 
d-^0   ^>i>-L  -L 

with  probability  1, 
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and 


Lim  P(|X„        -  e|  <  d)  =  1  -  2y. 
d->0     ^^d,Y,l 


Sielken  has  stated  that  the  limit  in  the  theorem  can  be 
interpreted  as  either: 

a.  The  level  of  the  sequentially  determined 
bounded  length  confidence  interval  con- 
verges to  the  prescribed  level,  1-2y,  as 
the  desired  length,  2d,  converges  to  zero; 
or 

b.  The  probability  that  the  error  in  the  final 
estimate  of  6  is  less  than  or  equal  to  d 
converges  to  the  prescribed  probability, 

1  -  2y,  as  d  ->  0. 

M.   DYNAMIC  STOCHASTIC  APPROXIMATION 

Fabian  [Ref.  39]  and  Dupac  [Ref.  3^1  have  considered  the 

case  where  the  desired  level,  9,  changes  during  the  iteration 

process.   The  following  discussion  is  by  Fu  [Ref.  53]  based 

on  Dupac 's  presentation. 

Let  M  (x)  =  M(x  -  6   +  6^ )  such  that  6   is  the  unique 
n  n    1  n  ^ 

root  of  M  (x)  =  0.   Let   (a  }   be  a  sequence  of  positive 
n  n 

numbers,  and  let  x^  be  an  arbitrary  random  variable. 

Define:    x  ^,  =  x  '  -  a  Y(x  ' ) , 
n+1    n     n   n   ' 

where 
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E[Y(x  ' ) |x, ,. . .X  ]  =  M  ., (x  ' ) , 
n   '  1'    n     n+1  n   ' 


and 


var[YCx  ')  Xt....,x  ]  =  a   <  +». 


The  meaning  of  the  above  algorithm  for  computing  x   -j  with 

the  modified  x  ,  i.e.  x  ' ,  is  that  when  we  get  an  estimate, 

x  ,  of  G  ,  we  make  a  correction  for  trend  to  obtain  x  '  before 
n      n'  n 

computing  X   -,  .   It  will  be  seen  by  the  follov/ing  theorem  that 
the  use  of  this  modified  algorithm  is  justified  when  G   is 

too  j^ 

a  linear  (or  nearly  linear)  function  of  n. 

THEOREM   [Ref.  3^1 

Assume  that  the  follov/ing  conditions  are  satisfied: 


(i)    M(x)  <  0  for  X  <  G,  and  M(x)  >  0   for  x  >   , 

(ii)   There  exist  Kq,  K,  such  that 

KqI^  -  ^il  1  |M(x)|  <_   K^|x  -  G^l   for  all  x. 

(iii)  a  =  a/n  ,   for  a>0,   ^<a<l. 

(iv)   G   varies  in  such  a  way  that 

G  _^,  -  (1  +  n"'-^)G   =  0(n~^)   for  w  >  a 
n+1  n 

(v)     E(x^  )  <  +°°. 


Then  (x   -  G  )  approaches  zero  in  the  mean  and 


^0 


0(n~")      for  h  <a  <2/3 

E[(x  -  e  )^]  = 

n    n  o,r^ 

OCn""^^^   )    for  2/3<a<l. 


The  mean  square  convergence,  as  well  as  convergence  with 
probability  1,  can  be  deduced  from  Dvoretzky's  theorem,  even 
under  slightly  more  general  conditions  on  9  .   A  similar 
modification  to  the  Kiefer-V7olfov/itz  procedure  is  indicated 
to  solve  for  a  moving  maximum  of  a  regression  function. 

An  interesting  algorithm  is  presented  in  Fu's  book 
[Ref.  53]  for  learning  of  slowly  time  varying  parameters 
using  dynamic  stochastic  approximation.   Here  Kesten's 
accelerated  scheme  [Ref.  69]  is  coupled  with  Dupac's  dynamic 
process  to  improve  the  rate  of  convergence. 

N.   CONTINUOUS  STOCHASTIC  APPROXIMATION 

In  order  to  obtain  a  continuous  version  of  the  stochas'  Lc 
approximation  method,  one  can  replace  the  difference  recur  ive 
relation  in  the  discrete  case  with  a  stochastic  differential 
equation.   Again  letting  the  desired  level  of  response,  a, 
be  equal  to  zero,  one  obtains  the  general  expression 


1^  X(t)  =  -a(t)Y(t,X(t)), 


where  a(t)  satisfies  the  conditions 


ill 


00  oo 

/  a(t)dt  =  «>   and    /  a  (t)dt  <  «. 


The  above  relation  determines  a  continuous  process  for 
stochastic  approximation  of  the  solution  to  the  equation 
M(x)  =  0.   Driml  and  Nedoma  [Ref.  31]  proved  that  the  process 
converges  when  Y(t,x)  is  monotonic  in  x  and  v/hen  Y(t,x)  is 
of  the  form  Y(t,x)  =  M(x)  +  h(t)   where  h(t)  is  an  ergodic 
process  with  zero  mean.   In  both  cases  the  function  X(t) 
approaches  the  desired  value,  0,  v/ith  probability  1  as 
t  ->  «».   In  the  proof  by  Driml  and  Nedoma 

for  0  <  t  <_  1 
for  t  >  1. 

0.   EXTENSIONS  OF  CONTINUOUS  STOCHASTIC  APPROXIMATION 

As  was  experienced  in  the  discrete  case  the  one  dimen- 
sional continuous  case  can  be  extended  to  the  multidimensional 
case.   However  many  theorems  which  are  valid  for  the  one 
dimensional  case  are  not  valid  for  the  multidimensional  case 
which  depends  heavily  on  stationary  point  theorems.   (I.e. 
theorems  concerning  a  point  x^  of  some  space  X  for  which 
F(Xq)  =  Xq  where  F  maps  X  into  X.)   For  a  discussion  of  these 
theorems  see  Driml  and  Hans  [Ref.  30]  and  Plans  and  Spacek 
[Ref.  61]. 

One  representation  using  continuous  stochastic  approxi- 
mation is  by  Kitagawa  [Ref.  71]  who  formulated  a  Robbins-Monro 
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model  where  the  Brownian  motion  process  is  used  to  represent 
the  random  disturbances  inherent  in  the  observations. 


^3 


IV.   FINDING  THE  MAXIMUM  OF  AN  UNKNOV/N  REGRESSION 
FUNCTION:   THE  KIEFER-WOLFOWITZ  METHOD 

A  problem  of  practical  importance  v/ith  a  regression 
function,  Y(x),  is  to  estimate  the  value  of  x,  say  9,  at 
which  the  expectation  of  Y(x),  denoted  M(x),  is  a  maximum. 
To  intuitively  introduce  the  method  consider  the  follov/ing 
argument  from  Wetherill  [Ref.  131]. 


Y(x) 


1    "2 
(a) 


^1   ^2 
(b) 


(c) 


Figure  1 


Suppose  two  observations,  y(x, )  and  y(Xp),  are  taken  a 
values  X-,  and  Xp  where  x,  <  Xp.   Then 

.  (a)   If  y(x-,)  <  y(Xp)   one  expects  the  maximum 
level,  e,  to  be  at  a  value  >_  Xp. 

(b)  If  y(x-.  )  >  y(Xp)   one  expects  the  maximum 
level,  0,  to  be  at  a  value  ^  x, .  ■ 

(c)  If  y(x,)  is  about  equal  to  y(xp)  more 
observations  are  necessary  to  determine  the 
region  of  interest. 
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Thus  It  would  be  reasonable  to  take  further  observations 
in  the  direction  indicated  by  the  slope  of  the  two  Y  values 
and  the  distance  moved  along  the  x-axis,  before  taking 
further  observations,  should  be  proportional  to  the  difference 
between  y(x^ )  and  y(xp).   Using  this  basic  idea  and  the 
initial  results  of  Robbins  and  Monro,  Kiefer  and  Wolfowitz 
[Ref.  70]  defined  the  follov/ing  procedure  for  stochastic 
approximation  of  the  maximum  of  a  regression  function. 

THEOREr^   [Ref.  131] 

Let  M(x)  be  a  regression  function  and  F(yIx)  a  family 
of  distribution  functions  and  assume  that  the  following 
conditions  are  satisfied: 


/  (Y(x)  -  M(x))^dP(Y|x)  <_  0^    <  +00 


and  assume  that  M(x)  is  strictly  increasing  for  x  <  9,  and 

that  M(x)  is  strictly  decreasing  for  x  >  9. 

Let   {a  }   and   {c  }  be  infinite  sequences  of  positive 
n  n  -if 

real  numbers  such  that 


00  00  00 

2      -2 
Ila=«',  J]ac<°o  Tac  <  °° 

n=i  n=l  n=l 

—1  —1/^ 

(for  example:   a   =  n   and  c   =  n    ■^).  Then  the  recursive 

n  n 

scheme  defined  by 


X    =  X   +  -^  [Y(x   +  c  )  -  Y(x   -  c  )] 
•1    n   c      n    n       n    n 


n 
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converges  in  probability  to  the  maximum,  6,  of  the  regression 
function  Y(x)  if  three  regularity  conditions  are  satisfied. 
They  are  listed  here  with  their  intuitive  meanings. 

Condition  1 . *   There  exist  positive  3  and  B  such  that 
|x   -  er  +  Ix^  -  e|  <  B   im.plies  |M(x^)-M(x2  )  |  <  bIx^-x^I 
for  all  X, ,Xp.   This  says  if  the  function,  M(x),  has 
a  derivative,  it  must  be  zero  when  x  =  G ;  as  a  result 
the  derivative  must  be  bounded  in  the  neighborhood  of 

e. 

Condition  2.    There  exist  positive  p  and  R  such  that 

I X-.  -  Xpl  <  p   implies   |M(x^)  -  M(Xp)|  <  R.   In  other 

words  if  M(x)  increases  too  abruptly  in  certain  regions, 

there  exists  a  positive  probability  that  it  may  reach 

+00  or  -<»;   as  a  result,  the  Lipschitz  condition  must 

be  satisfied. 

Condition  3»    For  every  6  >  0,  there  exists  a  positive 

"rr(6)  such  that  |x  -  GJ  >  5   implies 

inf    M(x  +  e)    -  M(x  -  e)  ^  ^^^^^   ^^^^  ^^  ^^^^    .^  ^ 

very  flat  function  the  rate  of  motion  toward  G  is  small. 
As  a  result,  the  absolute  value  of  the  derivative  must 
be  bounded  below. 


*   As  Blum  later  proved  [Ref.  8],  the  above  theorem 
holds  even  when  Condition  1  is  not  satisfied. 
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While  these  regularity  conditions  seem  restrictive  it 

is  only  necessary  that  they  hold  in  an  interval  [c,,Cp] 

where  it  is  known  a  priori  that  c^  <_  9  £.  Cp.   Suppose,  hov/ever 

that  some  proposed  level,  a   ±  c  ,  lies  outside  the  interval, 
1   J-  '   n    n*  ^ 

[c^,Cp]  and  one  cannot  take  an  observation  at  that  level. 

If  one  then  moves  x   so  that  the  offending  x   ±  c   is  at 

n  ^  n    n 

the  boundary  (c-,  or  Cp)  we  may  proceed  as  directed  and  the 
conclusions  remain  valid. 

A.  CONSTANT  COEFFICIENTS 

Burkholder  [Ref.  12]  proved  that  under  certain  conditions, 

the  Kiefer-Wolfowitz  procedure  can  still  be  used  if  c   is 

^  n 

held  constant  for  all  n  at  a  particular  value,  c^.   X  is 
then  asymptotically  normally  distributed  with  variance 
proportional  to  n   .   This  result  is  difficult  to  use  in 
practice  since  there  will  rarely  be  enough  information  about 
the  response  curve  to  choose  c^  as  required  by  Burkholder. 

B.  CONVERGENCE  WITH  PROBABILITY  1 

The  Kiefer-Wolfowitz  process  is  a  special  case  of  the 
Dvoretzky  process.   (I.e.  the  process  can  be  written  as  the 
sum  of  a  deterministic  term  and  an  error  term. )   This  can 
be  seen  by  writing 


X  .,  =  X   +  —  [M(x   +  c  )  -  M(x   -  c  )]  +  z  , 
■1    n   en    n       n    n      n' 


n 


where  the  error  term  is 


^1 


a 

z   =  — [Y(x   +  c  )  -  M(x   +  c  )  -  Y(x   -  c  )  +  M(x  -c  )] 
ncn    n       n    n       n    n       n   n 


It  follov;s  from  a  theorem  by  Dvoretzky  that  the  Kiefer- 
Wolfowitz  procedure  converges  with  probability  1  and  in 
mean  square  under  conditions  weaker  than  those  imposed  by 
Kiefer  and  V/olfowitz.   Burkholder  [Ref.  12]  also  proved 
convergence  with  probability  1  using  a  somewhat  different 
approach.   Later  Venter  [Ref.  122]  showed  that  the  K-W 
method  converges  almost  surely  to  the  maximum  if  this  is  the 
only  stationary  point  of  the  surface  and  some  other  condi- 
tions are  satisfied.   This  result  is  stronger,  in  a  sense, 
than  those  existing  previously. 

C.   MULTIDIMENSIONAL  KIEFER-WOLFOWITZ 

Let  (X-j...jX»,)  be  a  family  of  random  variables;   let 
F   .  • .      be  the  corresponding  distribution  function; 
and  let  M(x,  ,...,x„)  be  the  corresponding  regression  function 
We  then  desire  to  find  a  vector  X  =  0,  for  which  the  regres- 
sion function  is  maximal.   Assume  that  M(x)  has  a  unique 
maximum  at  the  point  x  =  6. 

Blum  [Ref.  7]  constructed  a  multidimensional  K-W  process 

in  the  following  manner.   Let  X  e  R-j  and  let  (e^,...,e^)  be 

an  orthonorm.al  basis  in  R., .   Then  for  some  real  c  >  0,  we 

N  ' 

make  N  +  1  observations  of  the  random  variable  Y(«), 


Y(x),  Y(x  +  ce^),  Y(x  +  ce^),...,  Y(x  +  ce^) 
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and  consider  the  vector 

^x  c  "  ^^^^^   "  ^^1^  "  Y(x)},...,{Y(x  +  ce^-)-Y(x)}]. 

Then  beginning  v/ith  some  arbitrary  vector,  x,  ,  construct  the 
sequence 


where 


a 

X  ^.  =  X   +  ~  Y(x  ), 
n+1    n   c     n  ' 
n 


Y(x  )  denotes  Y 

n  x  ,c 

n'  n 


Denote  the  vector  of  first  derivatives  of  K(x)  by  D(x),  and 
the  matrix  of  second  derivatives  by  A(x).   Then  the  follov;ing 
theorem  by  Blum  is  presented; 

THEOREM   [Ref.  7] 

Let  {a  }   and   {c  }  be  sequences  of  positive  real 
n  n 

numbers  that  satisfy: 


00  CO  CO 

2   -2 
Ea=<»,    Zac<'»    and    Z  a  c 
V.-1   "        ^-T   n  n    '  T   n  n 

n=l  n=l  n=l 


Moreover  assume  that  Y(x)  and  M(x)  are  such  that 


M(Y(x)^)  <  o^  <  co^ 
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M(  •  )  is  continuous  together  with  its  first  and  second 
derivatives,  and  for  any  e  >  0,  there  exists  a  p(e)  >  0 
such  that 

I  |x|  I  >  e    implies  that 

M(x)  <_  -p(e)  ,    and 

I |D(x)| I  >  p(e). 


2 
where  the  partial  derivatives  d   M(x)/8x.8x.  are  bounded 

for  all  i,   j  =  1 , . . . ,N. 

Then  the  sequence   {x  }   as  previously  defined  converges 

to  0  =  0   with  probability  1.   Note  that  each  step  in  Blum's 

algorithm  requires  N  +  1  observations.   Gray  [Ref.  59] 

proved  that  the  multidimensional  K-W  process  defined  by 


x  _,,  =  x   +  -^  [Y"^     -  Y"     ] 
n+.l    n   c    x,c     x,c-^ 
n    n'  n     n'  n 


also  converges  with  probability  one  where 


Y+ 


x  ,c   =  {Y(x   +  c  e^),...,Y(x   +  c  e„)} 
n'  n       n    n  1  '    '   n  •  n  N 


x^,c   =  {Y(x   -  c  e^),...,Y(x   -  c  e.,)} 
n'  n       n    n  1  '    '   n    n  N 


which  requires  2N  observations  in  each  step. 
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D.   ASYMPTOTIC  PROPERTIES  OF  K-V/  PROCESS 

The  first  results  concerning  asymptotic  properties  of 

the  Kiefer-V/olfov;ltz  process  were  obtained  by  Derman  [Ref.  2^] 

and  Dupac  [Ref.  32]  based  on  the  lemmas  of  Chung  [Ref.  l4]. 

Sacks  [Ref.  95]  has  discussed  conditions  for  asymptotic 

normality  of  x  .   If  c   is  chosen  to  tend  to  zero,  then  the 
'^     n       n  ' 

asymptotic  variance  of  x   can  never  be  made  as  small,  in 
order  of  magnitude,  as  Burkholder's  result  of  being  propor- 
tional to  n~  with  c  =  c^  a  constant.   The  most  general 
results  without  a  priori  assumptions  about  the  length  of 
the  interval  containing  the  point  x  =  6  have  been  obtained 
by  Sacks. 

THEOREM'  [Ref.  95] 

Let  M(x)  be  a  measurable  function  with  a  unique  maximum 
at  x  =  9,  and  assume  that  this  function  satisfies  the 
conditions : 

(i)     inf         (x-e)(M(x-£)  -  M(x+c);  ^   q 

e,  <_|x-e| ££2  ^ 

G<e<e 
o 

where  0  <  e   <  e,  <  e»  <  °°: 
o    1     2     ' 

(ii)    for  all  x,  M(x)  =  cCq  -  a^(x  -  G)^  +  6(x,0), 
where  a.,  >  0  and  6(x,6)  =  o(lx-6|  ) 
as  I  x-G  I  -^  0; 

(ill)   for  some  c^  >  0,  there  exists  positive 

constants  K,  and  K^ ,  such  that  for  all  x 
and  all  c  for  which  0  <  c  <_  c^-^ 
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K  ( x-e ) ^ < ( x-e ) [M ( x-c ) -M ( x+c ) ] c "^ <K2 ( x-0 ) ^ ; 
(iv)    For  every  e  >  0  there  exists  a  c   >  0  such 
that  for  all  c  satisfying  0  <  c  <_  c   and 
all  X  satisfying  |x-e|  <  c 

|6(x-c,6)  -  6(x+c,e)|c~^  <  e|x-e|. 


Further  assume 


LiniE[{Y(x)  -  M(x)}^]  =  a^/2 

x->e 


and 


Lim  Lim  Sup  /  (Y(x)-M(x) ) ^dP  =  0 

R-^^   e-^0   |x-e|<c    |Y(x)-M(x)  |>R 

Then  if  a   =  An"  .  where  A  >  1/2K, ,  the  random  variable 
n       *  1 ' 

n  c  (x  -  e)  is  asymptotically  normally  distributed  with 
n  n  ^ 


Mean  =  0 

Variance  =  a  A  (8aA  -  1) 


Sacks,  in  the  same  paper,  also  gave  the  similar  asymptotic 
limiting  distribution  for  the  multidimensional  K-W  process. 

E.   MAXIMUM  SAMPLE  EXCURSIONS  IN  KIEFER-WOLFOWITZ  PROCESS 

When  we  seek  a  maximum  or  minimum  using  the  Kiefer- 
Wolfowitz  process  the  possibility  arises  that  we  may  be 
working  v;ith  a  function  v;ith  more  than  one  local  maximum  or 
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that  we  do  not  want  to  reduce  the  performance,  M(x),  below 
some  minimum  level.   The  value  of  x  corresponding  to  this 
level  may  not  be  known.   In  both  of  these  cases  we  may  wish 
to  limit  the  excursions  to  some  given  miultiple  or  function  of 
I X,  -  0|,  with  a  high  probability,  while  still  being  certain 
that  X   ->  6   with  probability  1.   To  accommodate  this 
situation  Kushner  [Ref.  79]  presented  estimates  of  the 
following  form: 

For  any  m  <  °°  and  even  integer  r, 


P[    max     |x  -e|  >  e]  <  [E(x  -6)'  +  6   ]/e' , 
'vn>n>N  r 


where  6^,   depends  on  the  sequences  a  and  c   and  can  be 
N     ^  ^        n      n 

r 

made  arbitrarily  small  for  each  fixed  N  and  r,  while 
X  -^  6  with  probability  1  is  still  ensured. 

F.   ACCELERATED  CONVERGENCE  FOR  THE  K-W  PROCESS 

As  in  the  case  of  the  Robbins-Monro  process,  the  rate 
of  convergence  of  the  K-W  process  can  be  increased  by  using 
Kesten's  algorithm  [Ref.  69]  (See  Sec.  III. J).   Another 
method  for  accelerating  convergence  was  proposed  by  Fabian 
[Ref.  40]  who  later  showed  [Ref.  45]  that  the  multidimensional 
K-W  procedure  for  functions,  f,  sufficiently  smooth  at  0, 
the  point  of  minimum  (or  maximum)  can  be  modified  in  such 
a  way  as  to  be  almost  as  speedy  as  the  R-M  method.   This 
modification  consists  of  making  more  observations  at  every 
step  and  of  utilizing  these  to  eliminate  the  effect  of  all 
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derivatives  ^^T/dxJ,    j  =  3  ,5  ,7,  •  •  .  ,s-l .   Let  6^  be  the 
distance  from  the  approximated  6  after  n  observations.   Under 

similar  conditions  on  f  as  those  used  by  Dupac  [Ref.  32]  the 

2      -s/(s+l) 
result   E6    =  0(n        )   can  be  obtained.   Under  weaker 
n 

2  s/Cs+ll—r 
conditions  it  was  proved  that   6   n  -*■  0  with 

^  n 

probability  1  for  every  £  >  0. 

In  a  follow-up  paper  Fabian  [Ref.  46]  noted  that  there 

2 
are  many  designs,  d,  which  achieve  the  speed  of  E6     as 

stated  above.   He  derived  the  dependence  relation  of  d  on 


T  .    s/(s+l)  T-^x    2 
Lim  n  '  ^    '  E6 

n 


2 

so  that  one  may  choose  the  design  which  minimizes   E6 

In  yet  a  third  paper  in  this  series  by  Fabian  [Ref.  48], 

2 
the  results  of  a  design  which  minimizes   E5     is  utilized 

^  n 

and  Fabian  achieved  the  result 


E|  |x  -  0|  I^  =  o(t  "-^log^t  ) 
'  '  n    '  '       n    ^  n 


where  t   equals  the  number  of  observations  necessary  to 

construct  x-,,x^,...,x  . 
1*2'    »  n 

G.   THE  CONTINUOUS  KIEFER-WOLFOWITZ  PROCESS 

As  with  the  Robbins-Monro  method  we  have  a  continuous 
analog  of  the  Kiefer-Wolfowitz  method.   Let  us  consider  a 
method,  as  discussed  in  Loginov's  survey  [Ref.  8l],  for  an 
ergodic  random  process  Y,  .   Let  x  denote  an  N-dimensional 
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vector  with  coordinates  x  ,...,x  in  N-dimensional  Euclidian 
space  with  orthonormal  basis  e,,...,e,.  Then  the  regression 
function  is  M(x)  =  E[Y,  (x)].   Moreover  assume  that 


y^  ^[x,c(t)]  =  y^[x  +  c(t)e^]  -  y^[x-c(t)e^] 

where  c(t)  is  some  positive  function.  Then  the  continuous 
K-W  method  of  determining  a  minimum  point  for  a  regression 
function  is  described  by  the  equation 


5^=  -a(t)I^^^c-l(t)y^^Jx^,o(t)] 


with  initial  conditions   x.  ^  =  x.(0),   for  i  =  1-2,...,N, 

1,0     i    '  »    i         i    i 

where 


^i.t  =  1  -  «i'(''i,t'<^\'^-°r(=<i,t>Vtyi,t^ 


Here  G.   is  a  monotonic  function  with  derivative  bounded  on 
[b.  -  5  ,b . ]  and 

.      fO   for   x  <  b.  -  5, 
g/(x)  =  -  ^ 

^  U    ^or    x  =  b.  , 


and  G.   is  a  monotonic  function  with  derivative  bounded  on 
[a., a.  +  6]  and 

(0   for   X  >  a.  +6, 
G  "(x)  =  ~   ^ 

U   for   X  =  a. , 
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and  F^''"(y)  =  l- y/ec(t)   and   F^  (y)  =  1  +  y/ecCt)   for  e  >  0. 

The  essential  difference  between  the  original  discrete 

Kiefer-Wolfowitz  method  and  this  continuous  version  is  the 

fact  that  here  the  observations  need  not  be  independent  as 

they  were  in  the  discrete  case.   The  term  I.  ,  serves  to 

i,t 

limit  the  variable  X.  to  the  interval  [a.,b.]. 

1  1 '  1 

Sakrison  proved  the  following  convergence  theorem  for 
the  continuous  K-W  process. 

THEOREM   [Ref.  92] 

Represent  y.  in  the  form 

N 

y.(x)  =   r  g.(x)V.  . 
t      .^-^     J    j,t 


where  V.  ,  are  ergodic  random  processes  that  are  bounded 

with  probability  one,  while  g.(')  are  functions  whose  second 

partial  derivatives  with  respect  to  x.  are  bounded. 

Now  let  D,    denote  any  of  the  random  processes  V   , 

or  V  ^.,  V   ,    (e,  m  =  lj2,...,N).   Moreover  let  F,  be  ar  / 

bounded  functional  defined  on  the  processes  V     (t  <  t) 

^  ejT    — 

and  Bj5,j^(p)  =  M{(F^  -  M(F^))(D^^p  -  M(D^^p))}   be  such  that 


|Bfd(p)|  i  OpOpCK^/p^) 


where  K^  <   +».   Assume  that  the  regression  function,  M(x) 
satisfies  the  conditions 
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(grad  M(z)|^^^,x-e)  >  K^  | | x-G | |   , 


I  Igrad  M(x)|  I   1  K^  |  |  x-B 


for  0  <  K^  <  K   <  +~, 


8^M 


<    P 


for  i  =  1,2,...,N.   Then  if  the  relations 

00 

/   a(t)dt  =  « 
o 

00 

/   a(t)  c^(t)dt  <  °° 
o 

oo 

/   a(t)  a(l/2t)dt  <  °° 
o 

CO 

/   a(t)  c--^(t)dt  <  « 
o 

hold  for  the  functions  a(t)  and  cCt)^  the  solution  of  the 
stochastic  differential  equation  converges  to  0  in  mean 
square,  i.e. 


Lim  E{ I  j  x,  - 
t->°°      ^ 


II  >  =  0 


An  example  of  functions  satisfying  the  above  conditions 


are 
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a(t)  =  — - —      and    c(t)  =  — - — 
(t  +  1)"         ■         (t  +  l)"^ 

whe  re 

« 

i^  <  a  <_  1         and    y  >  ^$(1  -  a). 

Example   (by  Sakrison  [Ref.  92]). 
If  a  =  1   and  y   >  ^, 


then 


E{||x^  -  e||^}  =  0(l/n^^) 


It  is  not  difficult  to  see  that  in  the  continuous  case 
the  requirements  of  the  theorems  are  considerably  more 
stringent  than  those  in  the  discrete  case.   Here  constrain  3 
are  imposed  on  the  process  itself,  not  just  on  the  regres- 
sion function.   This  is  the  fundamental  difference  betweer 
discrete  and  continuous  stochastic  approximation  methods. 
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■  V.   PRACTICAL  ASPECTS 

A.   CHOICE  OP  a 

In  Section  III. J.  a  theorem  of  Dvoretzky  [Ref.  36]  v/as 
presented  giving  a  formulation  for  the  sequence,  a  ,  v/hich 
is  optimal  in  the  minimax  sense.   However  this  formulation 
contains  parameters  which  will  in  general  be  unknown  to  the 
experimenter.   The  need  then  arises  for  a  method  of  choosing 
a  sequence. 

Hodges  and  Lehmann  [Ref.  64]  recommended  using  coefficients 

of  the  form  a   =  c/n   where  c  is  chosen  to  minimize  the 
n 

2  2 
asymptotic  variance,   a  c  /(2a-.  c  -  1  ).   This  leads  to 

choosing   c  =  l/a-,  where  a,  is  the  slope  of  the  response 

function,  M(x),  at  the  desired  level  of   x  =  6.   (I.e.  choose 

c  =  1/a^ ,   where   a,  =  M'(e).)   This  does  not  reduce  the 

experim.enter '  s  dilemma  since  it  requires  a  priori  estimation 

of  another  unknown  parameter.   It  does  however  provide  a 

basis  for  sensitivity  analysis  on  expected  squarred  error 

based  on  changes  in  the  multiplier,  c,  in  terms  of  a,. 

Computer  simulations  were  performed  by  Hodges  and  Lehmann 

[Ref.  64]  and  by  Wetherill  [Ref.  I30]  with  very  similar 

results . 

In  general  choosing  c  <_  l/2a   should  be  avoided  since 

the  asymptotic  behavior  is  unknown  and  simulation  experiments 

indicate  that  large  biases  exist  when  c  is  chosen  to  be  too 

small.   Similarly  when  c  is  chosen  too  large  the  asymptotic 
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variance  increases,  however  it  increases  slov/ly  for  ca^  >  1 
Thus  when  the  value  of  a.,  is  unknown  it  would  be  more 
desirable  to  overestimate  the  value  of  c  than  to 
underestimate  c. 

In  the  special  case  where  M(x)  is  linear  it  is  easily 
shown  that  a  =  c/n,  with  c  =  1/M'(0),  is  a  desirable 
choice.   Consider  the  case  M(x)  =  bx  where  it  is  desired  to 
sequentially  arrive  at  the  value  of  x  vjhere  M(x)  =  0. 
Without  loss  of  generality  let  6=0.   Thus  the  value  of 
X  =  e  for  which  M(x)  =  0  is  G  =  0.   Choose  c  =  1/b  noting 
that  b  is  the  slope  of  the  response  function.   Then  for  any 
Initial  value,  x, ,  the  expected  value  of  Xp  can  be  easily 
computed  since 


^2  "  ^1  ~  S"  ^^^^1^  -  °^ 


implies  that 


;(X2)  =  x^  -  5"  E{Y(x-^)}, 


where 


E{Y(x-^)  }  =  M(x^)  =  bx^. 


Hence 


E(x2)  ^  ^1  ~  b  ^^1  ^    ^* 
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the  desired  6. 

Thus  in  the  linear  case  the  correct  choice  of  c  will 
move  the  estimate  to  the  neighborhood  of  6  early  in  the 
process  as  evidenced  by  the  fact  that  the  first  choice 
actually  produces  an  unbiased  estimate. 

B.   ESTIMATING  THE  SLOPE  TO  IMPROVE  ASYMPTOTIC  VARIANCE 

It  was  noted  by  Wetherill  [Ref.  130]  that  in  the  simple 

case  v/here  M(x)  is  a  linear  function  that  it  can  be  shown 

that  V7hen  we  use  as  the  sequence  of  a  ,   a  =  c/n  that 

^  n'   n 

choice  of  c  is  critical  to  the  efficiency  of  the  process 
where  efficiency  is  defined  as  the  reciprocal  of  the  ratio 
of  the  variance  for  a  given  c  to  the  variance  at  c  =  M'(6). 
See  Table  1  (also  see  Hodges  and  Lehmann  [Ref.  6h)). 

TABLE  1 

Asymptotic  Efficiency  of  the  Robbins-Monro 
Process  as  a  Function  of  c/M'(6) 

c/M'(e)      0.50    0.75    1.00    1.25    1.50    2.00    2.50 
efficiency     0    0.88   1.00   0.96   0.88   0.75   0.64 

Table  1  shows  that  there  is  a  large  range  of  c  for  which 
the  process  is  very  efficient,  with  c  =  M'(e)  being  optimal. 
It  also  would  imply  that  it  is  better  to  overestimate  the 
value  of  c  than  to  underestim.ate . 
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Burkholder  [Ref.  12]  discussed  the  possibility  of 

estimating  the  slope  of  M  at  6  but  this  procedure  was  not 

investigated  further  under  Venter  [Ref.  121]  presented  an 

extension  of  the  Robbins-Monro  procedure  which  estimates 

the  slope  of  the  regression  function  at  the  root.   The 

method  is  similar  to  the  Kiefer-Wolfowitz  procedure  in  that 

at  each  step  two  observations  are  taken,  namely  Y'=Y(x  +c  ) 
I-  '      '^       n  n 

and  Y''  =  Y(x  -c  )  where  c   =  cn~^(l  +  o(l)),   c  >  0, 
n   n         n  ^    '  '  >  > 

0  <  Y  5  ^«   Venter  required  that  we  know  constants  a  and 
b  such  that  0  <  a  <  M'(6)  <  b  <  ».   At  each  step  he 
estimated  the  slope  by  B  where 


^n  =  "■'  J,  (yj  •  -  yj  "  >/2Cj  , 


and  then  kept  the  estimated  slope  within  the  established 
bounds  by  using  A  as  the  estimate  of  the  slope  where 


a 

if  B  <  a 
n 

A      = 
n 

B 
n 

otherwise 

b 

if  B  >  b 
n 

Venter  then  defined  the  recursive  relation 


where 


X.,  =x   -d   A   ■*■  hiv    '  +  y  '  ')  , 

n+1    n    n  n     "^  n  ^n    ' 


d^  =-  n  ^(1  +  0(n"''^)) 
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Venter  showed  that  if  in  the  choice  of  (c  }   that 
Jj;  <  Y  <  ^   then 


n^'U      -   e)    t   N(0,a^/2(M'(e))^), 


and 


n'^iA     -   M'(0))  ^   N(0,a^/2(1  +  2y)) 


However  if  y  =  ^  then 


n^(x  -9)  f   N(-2a_c^/M'(0),a^/2(M'(e))^), 


and 


n^(A^-M'(0))  1   N(0,a''/3c^) 
n        *" 


Venter  stated  that  in  the  case  of  y  <  i?;  the  bias  in 

the  estimate,  x  ,,  of  9.  will  dominate  the  error.   There- 
*   n+1      ' 

fore  the  choice  of  y  =  ^  gives  a  small  negative  bias  but 
decreases  the  variance  in  the  estimate  of  the  slope. 

One  might  ask  whether  this  modified  procedure  is  actually 
at  a  disadvantage  since  it  requires  two  observations  per 
step.   Venter  showed  that  after  n  steps  (2n  observations) 
its  variance  is  still  achieving  the  minimium  value  of  the 

old  Robbins-Monro  procedure  after  2n  steps  (2n  observations). 

2 
Venter  also  provided  an  estimate  of  a   so  that  confidence 

intervals  could  be  constructed  for  his  procedure. 
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Fabian  [Ref.  ^7]  later  provided  a  sophisticated  proof  of 
asymptotic  normality  of  Venter's  procedure  and  of  a  similar 
procedure  applied  to  the  Klefer-Wolfowltz  method. 

C.   SMALL  SAMPLE  THEORY 

Considering  the  practical  applications  of  using  stochas- 
tic approximation  in  experiments  where  infinite  quantities 
of  test  items  may  not  be  available,  it  is  justifiable  to 
ask  how  small  sample  realizations  compare  with  asymptotic 
theory.   For  instance  if  an  experimenter  has  less  than  say 
50  animals  with  which  to  determine  the  LDp.^  (Lethal  Dose 
50%)    then  one  may  be  concerned  with  designing  a  stochastic 
approximation  method  with  which  to  obtain  the  "best"  possible 
results  and  an  estimate  of  the  expected  error. 

1.   Choice  of  x, : 


If  one  has  prior  information  that  0  (for  say  M(0)= 
0.50)  lies  in  a  narrow  interval  and  picks  x.  in  that  interval 
then  one  can  expect  the  estimates  to  arrive  in  the  neighbor- 
hood of  6  within  a  few  observations.   If,  however  there  is 
little  prior  knowledge  of  the  magnitude  of  G,  then  an  initial 
bad  choice  of  x,  can  induce  a  large  bias  term  which  will 
dominate  the  observations  for  many  steps. 

2.   Choice  of  Multiplier,  a  : 
*   n 

As  previously  discussed  a   =  c/n  where  c  equals  the 

inverse  of  the  slope  of  K(  •  )  at  9  is  optim.al  in  a  sense. 

Thus  one  must  accurately  estimate  c  for  optimal  conditions. 
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If  c  is  too  small  the  step  sizes  may  be  too  small  to  get  to 
6  before  the  number  of  samples  are  depleted.   Similarly 
if  c  is  too  large  the  estimate  may  overshoot  G  back  and  forth 
For  a  detailed  analysis  see  Section  V.A. 
3.   How  to  Allocate  Samples: 

If  an  experimenter  has  N  samples  to  test,  should 
he  test  one  at  each  step  and  take  N  steps  or  test  m  at  each 
step  and  take  n  =  N/m  steps?   Note  that  taking  more  than 
one  observation  at  each  level,  x. ,  yields  a  more  accurate 
estimate  of  M(x. )  =  E(Y|x.).   It  was  noted  by  Wetherill 
[Ref.  130]  and  by  Cochran  and  Davis  [Ref.  17],  and  was  proven 
by  Block  [Ref.  51,  that  the  variance  of  the  estimate  of  6 
depends  only  on  the  total  samples,  not  on  the  sampling 
schem.e;  however  the  corresponding  bias  term,  and  hence 
mean  squarred  error,  is  affected  by  the  scheme. 

Cochran  and  Davis  presented  two  graphs  which  illus- 
trate their  analysis,  which  is  reproduced  here.   In  their 
notation  a  =  the  standard  deviation  of  the  observation,  Y(x), 
at  x  =  e.  (which  in  general  will  be  unknown  to  us).   Also 
note  the  following  terminology: 


'0 


MSE 
C 
m 
n 


Mean  Squarred  Error; 

Optimal  choice  of  coefficient,  c; 

#  of  samples  taken  at  each  level; 

#  of  levels  or  steps. 


where  nm  =  N  =  Total  number  of  samples. 
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Figure  2 


Figure  3 


yr\-c        y^-3      y^-L 


Figure  2  implies  that  if  x,  is  relatively  unknovm  that  it 
is  more  desirable  to  overestimate  c  so  we  are  not  "trappe  " 
by  a  large  initial  bias  and  small  steps.   Figure  3  implie  3 
that  if  the  initial  guess,  x, ,  is  more  than  about  2a  away, 
then  sampling  should  be  done  one  at  a  time,  while  if  the 
initial  guess  is  very  accurate,  then  the  MSE's  are  smaller, 
although  very  slightly  so,  for  larger  m.   Thus  as  a  general 
rule  unless  we  know  that  the  initial  guess  is  very  accurate 
or  unless  the  cost  of  setting  up  experiments  at  different 
levels  is  high,  sampling  should  be  conducted  as  one  sample 
per  level. 
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Another  question  which  the  experimenter  may  ask  is  how 
much  more  accurate  an  estimate  becomes  if  he  doubles  the 
number  of  samples,  say  from  scheme  1:  m=  3j  n=  8  to 
scheme  2:  m  =  6,  n  =  8.   Doubling  the  value  of  m  in  this 
way  reduces  the  variance  of  the  estimate  by  almost  exactly 
one-half,  but  produces  only  a  slight  decrease  in  the  bias. 
Consequently  if  V,  B,  and  M  are  the  Variance,  Bias  and  MSE 
for  m  =  3>  n  =  6  (scheme  1)  then  the  corresponding  MSE  for 
m=6,  n  =  8  (scheme  2)  can  be  predicted  by  the  expression: 
MSE   =  (B^  +  V/2)  =  (B   +  M)/2.   (This  is  assuming  x.  is  the 
same  for  both  schemes.)   This  expression  overestimates  the 
MSE,  but  at  most  by  only  a  few  percent. 

For  further  results  and  comparisons  of  methods  utilizing 
small  sample  theory  see  Cochran  and  Davis  [Ref.  17],  Davis 
[Ref.  22],  Wetherill  [Ref.  130],  and  Odell  [Ref.  8?]. 

D.   ESTIMATION  OP  EXTREME  QUANTILES 

For  estimates  of  quantiles  near  the  mid-region  of  a 
quantal  response  curve  the  Robbins-Monro  method  appears  to 
perform  quite  well.   In  fact  for  estimation  of  the  0  ^^ 
quantile  both  Wetherill  [Ref.  130]  and  Davis  [Ref.  22] 
showed  that  sample  sizes  as  sm.all  as  35  produced  results 
which  were  in  good  general  agreement  with  asymptotic  theory. 
However  in  areas  away  from  the  neighborhood  of  6  ^^   the 
small  sample  estimates  frequently  have  large  biases  and 
have  variances  greatly  in  excess  of  theoretical  predictions. 
This  behavior  was  also  noted  by  Stillings  and  Logan  [Ref.  108] 


67 


To  try  to  explain  this  phenomenon  Wetherill  [Ref.  131] 
presented  the  following  example. 

Suppose  an  experimenter  wishes  to  estim.ate  6  g^  and 
that  his  initial  level,  x^ ,  is  very  close  to  the  true  value. 
Suppose  further  that  the  first  observation  is  zero,  a  failure 
(as  it  will  be  about  once  every  ten  trials),  then  the 
second  observation  will  be  taken  at  the  level 


x^  =  x-j^  -  c(0  -  0.90)  =  x^  +  .90c. 


This  value,  Xp,  may  well  be  far  above  6  q^.      Assume  that 
the  next  two  values  will  be  positive  (a  success).   This 
leads  to 


x^  =  x^  -  |(1  -  .90)  =  x^  +  .85c 


and 


x^^  =  X  -  ^(1  -  .90)  =  X  +  .8l6c. 


As  can  be  easily  observed  the  level  of  testing  is  very 
slowly  returning  to  the  vicinity  of  9  ^^.      In  fact  a  minimum 
of  about  e    observations  are  necessary  to  pass  below  x, . 

Methods  using  accelerated  stochastic  approximation  tend 
to  minimize  this  effect  but  the  most  interesting  treatment 
of  this  area  thus  far  has  been  done  by  Goodman,  Lev7is,  and 
Robbins  [Ref.  58].   Here  a  "maximum  transformation"  is 
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employed  by  taking  multiple  samples  at  a  level.   If  it  is 
desired  to  estimate  F(e)  -    .99,  where  F(x)  is  a  cumulative 
density  function,  then  V  samples  are  taken  at  each  level  S 
Here  V  is  the  solution  of  the  equation 


(0.99)^  =  0.50  . 


Then  let 


P    (s)  =  Prob  {(max    S, )  <  s}  =  [F(s)]^ 
max  l<i<V  ^   ~ 


In  this  case  the  solution  for  V  is  V  =  69,  and  69  samples 
would  be  taken  at  each  iteration.   Thus  the  problem  has 
been  transformed  into  estimating  the  0  ^^   level  where  the 
properties  of  the  Robbins-Monro  process  are  knovm  to  work 
well. 

Yuguchi  [Ref.  135]  followed  this  same  "maximium  trans- 
formation" technique  and  then  applied  variance  reduction 
and  jack-knifing  techniques  to  improve  the  rate  of  convergence 
and  to  reduce  bias. 

E.   THE  CASE  WHERE  M(x)  STOPS  BEING  A  CONSTANT 

Consider  a  response  function  where  there  is  no  reaction 
for  X  <  0.   (I.e.  M(x)  =  0  for  x  <  0  and  M(x)  >  0  for  x  >  8. ) 
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M(x) 


Often  times  one  is  interested  in  the  level,  6,  when  response 
first  occurs  (see  Guttman  and  Guttman  [Ref.  60]).   Friedman 
[Ref.  51]  proved  the  follov;ing  theorem. 

THEOREM   [Ref.  51] 

Let  the  following  conditions  be  satisfied; 

(i)     |M(x)|  <  L|x|  +  K; 

(ii)    a^(x)  1  a^  <  +~; 

(iii)   if  X  <  e,  then  M(x)  =  0, 

if  X  >  e,  then  M(x)  >  0; 

(iv)    for  every  0  <  6 inf      |M(x)|  >  0. 

6£|x-0| 

Then  choose  {a  }.  {d  }  such  that 
n     n 


a>0,   Ia=«',   13.        <°°,   d>0,    Zad=°°, 
n=l  ^       n=l  ^         ^      .  n=l  ^  ^ 

d  ->  0 
n 


and  define  the  relation 
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x.T=x  +a(d  -y) 
n+1    n    n  n   "^n 


Then  x   ->  6   with  probability  1  and  in  mean  square. 

This  theorem  says  that  one  can  use  stochastic  approxima- 
tion to  find  that  point  at  which  the  regression  function 
stops  being  a  constant  if  the  value  of  this  constant  is 
known.   If  one  does  not  know  the  value  of  the  constant, 
Friedman  has  proved  another  theorem  v/hich  imposes  sharper 

conditions  on  M(x)s  for  which  x   does  converge  to  the 

*  n  ^ 

desired  value  61 

P.   BOTH  VARIABLES  SUBJECT  TO  ERROR 

In  the  usual  Robbins-Monro  procedure  it  is  assumed  that 
the  regression  function,  M(x  )  is  observable  subject  to  an 
error  term,  say  v  .   One  might  ask  under  what  conditions  will 
the  process  converge  if  there  exists  a  random  error  compon- 
ent, say  u  ,  in  the  level  setting  of  x  as  in  practice  it 

Jv  I  i 

is  not  always  possible  to  precisely  measure  or  set  the 
desired  amount.   Dupac  and  Krai  [Ref.  35]  discussed  two  sl  ;h 
cases.   In  the  first  case  the  error  in  setting  the  level  is 
assumed  to  be  unaffected  by  the  experimenter.   In  the  second 
case  it  is  assumed  that  the  error  in  the  x  level  can  be  made 
arbitrarily  small  for  an  inversely  proportional  price.   In 
this  first  case  of  "irreducible  errors"  Dupac  and  Krai 
proved  the  following  theorem. 
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THEOREM   [Ref.  35] 

Assume  that  the  following  conditions  are  satisfied: 

(i)     M(-)  is  odd  with  respect  to  9, 

i.e.  M(0  +  x)  =  -M(e  -  x)   for  all  x; 

(ii)    M(x)  is  strictly  increasing; 

(iii)   |M(x2)-M(x^) I  1  C^  +  C^lx^-x^l   for  all  x^^x^; 

(iv)    U   is  a  symmetric  random  variable  for  each 

X,  i.e.  P(U   <  c)  =  P(U   >  -c) : 
i  ^  X  —  X      ^ 

(v)    Var  U   <  C^   for  all  x: 
X  -  3 

(vi)   Var  V  <  C,,   for  all  x. 
X  —  ^ 

Then  the  Robbins-Monro  procedure  defined  by 


X  _^T  =  X   -  a  {M(x  +  u  )  +  V  } 
n+1    n    n  n  x     x 


converges  to  0  with  probability  1  as  well  as  in  mean  squar  . 

In  the  second  case  of  Dupac  and  Krai,  where  one  can 
decrease  the  x  setting  errors,  U  ,  by  an  inversely  proper- 
tional  price,  they  proved  what  intuition  would  tell  us  v:as 
correct.   They  showed  that  it  is  needless  to  pay  for  high 
precision  at  the  starting  steps;  the  precision  should  be 
increased  in  the  course  of  the  approximation  process. 

G.   THE  CASE  OF  a  UNKNOWN 

Consider  the  following  scenario:   Suppose  a  scientist 
is  comparing  two  drugs,  a  test  drug  and  a  control  drug. 
He  is  interested  in  designing  a  biological  assay  to  estimate 
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the  number  of  dose  units  of  the  test  drug  necessary  to 
elicit  the  same  mean  response  as  the  standard  dose  of  the 
control  drug.   Suppose  further  that  the  experimenter  knows 
little  about  the  shape  of  the  response  function  associated 
with  the  test  drug  and  about  the  probability  distribution 
of  response  at  any  one  dose  level  of  either  drug. 

Make  the  following  notational  identifications:   Let  an 
observed  response  to  the  control  drug  administered  at  the 
standard  dose  level  correspond  to  the  random  variable,  Z, 
with  mean  a.   Let  the  observed  response  to  the  test  drug 
at  dose  level  x  correspond  to  Y(x)  with  mean,  M(x).   Let 
6  be  the  unknown  dose  level  of  the  test  drug  such  that 
M(9)  =  a.   Then  under  weak  conditions  on  M(x),  and  the 
distributions  of  Y(x)  and  Z.,  the  process  defined  by 


X  -T  =  X   -  a  {Y(x  )  -  z  } 
n+1    n    n    n     n 


satisfies  all  known  properties  of  the  original  Robbins- 

Monro  procedure.   It  seems,  as  was  noted  by  Hamilton  [Ref.  62], 

that  this  procedure  does  not  use  all  available  information 

-1  ^ 
at  each  step.   Since  n    E   Z.   is  a  better  estimator  of 

i  =  l   ^ 

a  than  just  z  ,  one  would  expect  a  smaller  mean  squarred 

error  from  the  sequential  estimate  of  a,  especially  in 

cases  of  small  sample  sizes.   To  analyze  this  Hamilton 

compared  two  processes. 
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Process  1  takes  multiple  observations  at  each  step 
and  computes  an  estimated  value  of  a  based  only  on 
the  observations  taken  at  that  step  (possibly 
only  one). 

Process  2  takes  the  same  num.ber  of  multiple  obser- 
vations at  each  step  but  computes  the  estimate  of 
a  based  on  all  of  the  observations  from  the 
beginning  of  the  process. 

Hamilton  then  showed  that  under  certain  conditions  it 
is  better,  in  magnitude  of  mean  squarred  error,  to  take 
the  most  recent  control  observations  (process  1)  rather 
than  taking  sequential  steps  tov;ard  the  mean  of  the  control 
observation.   This  result,  based  on  large  sample  theory, 
remains  true  in  a  simplified  (linear)  small  sample  situation 
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VI.   APPLICATIONS 

In  this  Section  several  applications  of  stochastic 
approximation  to  a  variety  of  fields  will  be  presented. 
The  first  example  is  an  application  to  a  problem  in  bio- 
logical research  by  Guttman  and  Guttman  [Ref.  60].   It  is 
initially  presented  since  it  is  a  sim.ple,  straight forv/ard 
problem  of  the  type  for  which  the  Robbins-Monro  method  was 
conceived  (also  see  Hawkins  [Ref.  63]).   This  straight- 
forward use  of  the  R-M  method  is  also  applicable  to  indus- 
trial process  control  as  discussed  by  Comer  [Refs.  I8  and 
19)  where  a  lag  in  process  response  is  incorporated  into 
the  formulation. 

However,  more  practical  use  of  stochastic  approximation 
is  based  on  the  concepts  of  maximization  or  minimization  of 
functions.   Many  problems  which  can  be  analytically  solved 
if  the  response  format  is  known  fall  nicely  into  the  sto- 
chastic approximation  framework  since  answers  do  not  depen 
on  the  assumed  parameterization.   Also  many  problems  based 
on  a  criterion,  such  as  minimizing  expected  squarred  error, 
can  be  computationally  very  difficult  to  solve,  as  the 
solutions  may  require  matrix  inversions,  as  in  the  multi- 
dimensional case.   Many  problems  of  this  type  (see  Sardis, 
Nikolic  and  Fu  [Ref.  99])  fall  into  the  stochastic  approxi- 
mation framev;crk  and  yield  computationally  sim.ple  algorithms 
which  require  very  little  storage  space  when  performed  on 
a  digital  computer. 
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In  a  recent  book  edited  by  Mendel  and  Fu  [Ref.  83]  a 
chapter  has  been  devoted  to  applications  of  stochastic 
approximation  methods.   Also  Tsypkin  [Ref.  112]  has  nicely 
reviewed  the  im.portant  applicability  of  the  Robbins-Monro 
process  and  related  stochastic  approximation  methods  to 
problemis  concerning  pattern  recognition,  adaptive  filters, 
adaptive  automatic  control  systemiS,  and  adaption  in  opera- 
tions research  and  reliability  theory.   Some  of  the  additional 
papers  which  have  considered  these  latter  types  of  application 
of  stochastic  approximation  are  by  Aizerman  et  al.  [Ref.  1], 
Ernst  [Ref.  38],  Kailath  and  Schalkwijk  [Ref.  67],  Lee 
[Ref.  80],  Sakrison  [Refs.  93,  9^],  Sklansky  [Ref.  IO6], 
Tsypkin  [Ref.  Ill]  and  Ulrich  [Ref.  II6]. 

A.   APPLICATION  TO  A  PROBLEM  IN  BIOLOGICAL  RESEARCH 

Guttman  and  Guttman  [Ref.  60]  desired  to  treat  Para- 
mecium Caudatum  cells  with  a  substance,  kinetin,  which  wou  .  i 
stimulate  cell  division,  and  to  estim.ate  the  time  at  which 
a  certain  level  of- this  cell  division  was  attained.   They 
postulated  that  the  ratio  of  the  number  of  daily  cell 
divisions  of  treated  paramecia  to  untreated  param.ecia  (K/C) 
was  a  monotone  increasing  function  of  time  of  exposure  to 
kinetin.   Guttman  and  Guttman  stated  that  they  had  no  idea 
of  the  underlying  probability  distribution  concerning  the 
ratio,  K/C,  thereby  making  stochastic  approximation  a  very 
convenient  schem.e.   A  Robbins-Monro  schem.e  v/as  formulated 
to  estimate  the  time  at  which  K/C  =  1.10.   The  initial 
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guess  of  X,  =  30  hours  was  chosen  with  the  expectation 
that  the  desired  value  of  X  was  soinev/here  below  this. 

The  sequence   {a  }   v/as  chosen  as  20/n  to  allow  for 
large  corrections  in  the  first  few  steps  and  smaller 
corrections  thereafter.   The  stochastic  approximation 
sequence,  as  formulated  for  this  problem,  then  looked  like 


20 
'n+1   ''n  '  n  ^^'^^    -^n 


X„^-,  =  X„  +  4^  (1.10  -  Y„), 


where  Y   =  the  observed  response  ratio  at  time  X  .   Guttman 
n  n 

and  Guttman 's  table  of  observations,  Y  ,  and  computed  next 

levels,  X  ,  is  reproduced  in  Table  2. 
'   n*      ^ 

The  experiment  v/as  terminated  at  n  =  13  as  no  appreciable 
differences  appeared  among  the  X  from  trial  6  onward. 
Note  that  the  mean  value  of  the  observations  from  n  =  6 
onwards  is  in  fact  equal  to  1.10. 


B.   AN  APPLICATION  TO  TAILORED  TESTING 

Suppose  an  educator  or  psychologist  desires  to  measure 
some  mental  trait  of  an  individual.   For  instance  suppose 
it  is  desired  to  measure  the  level  of  difficulty  of  questions, 
X,  such  that  the  individual  will  get,  say  a  =  70^  of  them 
correct.   Suppose  further  that  the  educator  has  a  bag  full 
of  questions,  each  assigned  a  level  of  difficulty,  B. ,  such 
that  the  probability  that  an  individual,  whose  true  ability 
is  at  level  i,  will  correctly  answer  a  question  of  difficulty 
B.  is  equal  to  a  =  .70.   This  is  sim.ilarly  written 
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TABI.E  2 
Stochastic  Approximation  of  Hours  of 
Treatment  Required  with  1.5  mg/1 
Kinetin  to  Produce  an  Expected  Ratio 
of  Divisions  Kinetin/Control  Equal  to  1.1. 

TRIAL(n)         HOURS    OF   TREATMENT(x    )       OBSER\rED   K/C(Y    )      WEIGHT(a    ) 

n  n  n 

1  30  1.06?  20 

2  30.7  1.30  10 

3  28.7  1.131  6.67 

4  27.3  1.223  5 

5  26.6  1.577  ^ 

6  2^.8  1.133  3.33 

7  2^1.6  0.89  2.86 

8  25.2  1.00  2.5 

9  25.5  0.81  2.2 

10  25.6  1.31  2 

11  ■  25.1'  1.21  1.82 

12  24.8  1.03  1.66 

13  24.9  
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P^(B^)  =  a. 


This  idea  was  presented  by  Lord  [Ref.  82]  who  proposed 
a  computer  controlled  testing  scheme  where  questions  of 
difficulty  B.  would  be  recursively  selected  by  the  scheme 


B._^,  =  B.  +  a.  {Y(B.  )  -  a}. 
1+1    1    11 


Thus  the  scheme  would  eventually  converge  to  the 
individual's  true  ability,  provided  that  the  assumptions 
v/ere  correct. 

C.  UPGRADING  OP  INERTIAL  NAVIGATION  SYSTEMS 

Consider  a  navigational  platform  with  several  high  grade 
Gyro's  required  for  motion  sensing.   Bernard  Lee  [Ref.  80] 
suggested  replacing  all  but  one  gyro  with  a  lower  grade,  l-"^ss 
expensive  gyro.   A  supervisory  system  based  on  a  continucu 
Keifer-V/olfowitz  stochastic  approximation  algorithm  sim.ilc  ' 
to  that  developed  by  Sakrison  [Ref.  90]  is  then  used  to 
estimate  the  drift  rate  of  each  of  the  low  grade  gyros  and 
to  apply  a  corrective  signal.   This  concept  permits  each 
substandard  gyro  to  acquire  a  precision  approaching  that 
of  the  higher  gyro. 

D.  APPROXIMATION  OF  DISTRIBUTION  AND  DENSITY  FUNCTIONS 
Consider  the  distribution  F(a)  =  Prob  [X  <_  a]  where 

X  is  a  scalar  random  variable.   The  Droblem  is  to  find  an 
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approximation  to  F(*)  by  a  linear  combination  of  a  previously 

T 
chosen  vector  of  functions,   ^  (x)  =  (f,  (x),  fp(x),...,f  (x)), 

where  the  superscript  T  denotes  the  transpose  of  the  column 

vector  (|)(x).   Thus  we  desire  to  find  a  column  vector  of 

coefficients,  C,  such  that  our  approximation 


F(X)   =   Cp"^  (J;(X) 


minimizes  some  criterion  such  as  minimizing  expected  squared 
error  in  a  region  of  interest  (a,b).  Denote  the  mean  square 
error  as 


•b  T      2 

Jp(C)  =   /   {F(x)  -  Cp  4>(x)}^(ix. 
"a  -   ~ 


Now  minimizing  J-p(C)    is   equivalent   to   solving  the   matrix 
equation 


^^.^F^^^^    =      /    P(x)    4,(x)dx   -   cJ     f    (t,(x)    (},'^(x)dx   =    0 
dC  a.  "  ~^a~~ 


or 


b  ^ 

/      F(x)    (j,(x)dx    -   K    Cp      =    0, 


where 


b  ,p 

K  =      /      (j)(x)    ({)    (x)dx 
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is  an  n  X  n  matrix. 

Nov;  define  a  random  function  Z„(y,x)  such  that 


1   if   y  1  X, 
Zp(y,x)  = 

'  0    if   y  >  x. 


and  such  that 


E[Zp(y,x)]  =  l-F(x)  +  0(1  -  F(x))  =  F(x). 


Thus  the  regressive  matrix  equation 


b  m 

E  {  /   Z(y,x)  (|)(x)dx}  -  K  C„   =  0 
a  ~         ^  "^ 


is  equivalent  to  our  previous  equations  for  finding  the 

minimum  of  the  criterion,  J„(a).   But  this  can  now  be  solved 

r 

by  a  stochastic  approximation  algorithm  if  successive 
independent  samples  of  the  random  variable,  Y,  are  availab  e. 
The  algorithm  can  be  written  as 


Cp(j  +  1)  =  Cp(j)  +  a^.  [Bp(Y(o)  -  K  Cp(j)] 


where  we  define 


b 
Bp(Y(o))  =    /  Zp(y(j),x)$(x)dx 
a 


i.e. 
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b 
/  J(x)dx       if    y(j)  <  a 
a 

/      ^ 
B^(y(j))  =<      /   (l)(x)dx    if    a  <  y(j)  <  b 

-^       ^  y(j)   - 


if    b  <  y(j), 


where  y(J)  is  the  J   sample  from  the  distribution  and 

the  sequence  a   satisfies 
^       n 


00  00 

2 

r   a^  =  0°       and        Z      a,   <  +«>. 

J=l  ^  J=l  ^ 


Thus  the  above  algorithm  now  fits  the  format  of  multi- 
dimensional stochastic  approximation.   In  particular,  if 
the  matrix  K  is  positive  definite,  it  satisfies  the  conditions 
of  a  theorem  by  Blum  [Ref.  7,  theorem  2]. 

Then  the  sequence  Ct;j,(j)  converges  with  probability  1  to 
the  value  which  minimizes  J„(C).  This  value  can  be  written 
as 


-1    ^ 
C„«  =  K  -"   /   F(x)  ^(x)dx, 


but  requires  inversion  of  an  n  x  n  matrix  to  solve  directly, 
Therefore  the  above  algorithm  enables  one  to  find  a  minimum 
mean  square  error  approximation  to  a  distribution  function 
for  which  the  only  available  information  is  the  collection 
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of  sample  values  randomly  selected.   This  algorithm  is  from 
a  paper  by  Blaydon  [Ref.  4]  and  can  be  similarly  extended  to 
approximate  density  functions. 

A  refinement  of  this  algorithm  by  Blaydon  was  presented 
by  Deuser  and  Lainiotis  [Ref.  27].   The  refinement  incorporates 
a  double  stochastic  approximation  algorithm  to  recursively 
generate  a  matrix  from  each  independent  observation  and 
then  to  recursively  generate  the  estimate  of  the  coefficient 
vector  using  the  previously  generated  matrix  as  an  observa- 
tion.  Deuser  and  Lainotis  presented  the  example  where  the 
unknown  probability  is  F(x)  =  1  -  e     for  x  >_  0 . 

The  approximating  function,  F(x),  is  to  be  a  weighted 
sum  of  the  first  three  Laguerre  polynomials 


and  the  initial  choice  of  the  coefficient  vector  is  the 
zero  vector.   It  can  be  shown  analytically  that  the  optimal 
coefficients  are: 


c"^  =  (.i}80    -.186    -.239) 


In  a  computer  simulation  using  1000  sam.ples  and  using 

the  step  seauence  a   =  1/n,  Deuser  and  Lainiotis  obtained 
-       n      - 

estimates  for  the  coefficients  which,  on  the  average,  did 
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not  differ  by  more  than  .01  in  absolute  value  from  the 
optimal  coefficients. 

E.   AN  APPROACH  TO  PATTERN  RECOGNITION  USING  STOCHASTIC 
APPROXIMATION  TO  MINIMIZE  RISK 

Consider  a  mixture  from  two  samples  where  an  observation 
which  is  dravm  at  random  is  of  type  1  with  unknov/n  proba- 
bility p,  and  is  then  of  type  2  v;ith  probability  1  -  p. 
It  is  desired  to  measure  some  quality  of  the  samples,  call 
it  X,  and  apply  a  decision  rule  say 

{  =  1    if    X  <_  a 
=2    if    X  >  a, 

where  a*  =  6  is  some  unknown  value  which  minimizes  a  risk 
function,  R(d(x,a))  which  we  have  chosen.   Since  the  choice 
of  a  completely  specifies  the  decision  rule  and  risk  function, 
denote  them  d(a)  and  R(a). 

Now  R(a)  can  be  viewed  as  a  regression  function.   By 
this  it  is  meant  that  there  exists  a  random  variable,  Y, 
with  conditional  probability  distribution  function  F(Y|a) 
such  that 

R(a)  =  E(Y|a). 


Such  a  random  variable,  Y,  is  defined  as  follows: 

Let  Y  (given  a)  =  C . .  if  Z  is  an  observation  which 

actually  is  of  type  i  and  is  classified 

by  d(z,a)  =  type  j . 
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In  general  C. 
^        li 


=  0. 


Then  a  simple  one  dimensional  stochastic  approximation 
scheme  can  demonstrate  the  solution  process.   Consider  a 
test  .sample  where  it  is  not  knov/n  of  which  population  each 
item  is  a  member.   Then  define  the  scheme 


b  _ 

a  J.-1  =  a  -  ^-^  (Y   -  Y  ), 
n+1    n    2  d  ' 

n 

where  Y  =  C. .  if  sample  Zp  _,  is  actually  of  type  i  and 

d(Z„   ,  a  +d  )  =  type  j.   Y~  =  C. .  if  sample  Z„   is  actually 
2n-l,  n   n     ''^   "^         ij       ^     2n  '^ 

of  type  i  and  d(Z^  ,a   -  d  )  =  type  j,  where  A^  is  chosen 
o  t^  ^2n'n    n     j  t-      u  >  -^ 


arbitrarily  and  the  conditions 


I      b      =   oo 
n=l  ^ 


Lim  d   =  0 , 


00  ■ 

E   (b  /d  )^  <  +«>, 

n=l   ^  ^ 


are  satisfied.  Note  that  the  risk  function  must  satisfy 


sup       D  R(a)  >  0 
l/k<a-e<k 


and 
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inf       D  R(a)  <  0   for  all  integers  K, 
l/k<a-e<k 


where  D  R(a)  =  the  limit   superior  of 


R(a4.h)  -  R(a)   f<,j,  ^  -.  o 
h 


and  D  R(a)  =  the  limit  inferior  of 


"(^^h)  -  R(a)    p^^  ^  ^   o_ 
h 


Note  that  R(a)  does  not  have  to  be  dif ferentiable  at  all  a. 

Then  if  the  above  conditions  are  satisfied,  a   converges 

2 

in  probability  to  0  and   lim   [(a   -  6)  ]  =  0. 

n-^~ 
Then  the  decision  rule  which  minimizes  the  risk  function 

is 

1  if   X  <  e, 
d(x,e)  = 

2  if   X  >  e. 


The  above  one-dimensional  scheme  was  presented  by  Cooper 
[Ref.  20]  who  stated  that  the  application  to  a  K-dimensional 
scheme  including  noise  could  be  performed  by  modifying  the 
above  procedure  to  the  multidimensional  case  of  Blum  [Ref.  7] 

It  is  noted  that  the  above  sample  falls  into  the  frame- 
work of  Bayesian  learning  and  decision  rules.  An  excellent 
paper  by  Chien  and  Fu  [Ref.  13]  discusses  Bayesian  related 
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learning  procedures  which  can  be  shown  to  be  a  special  case 
of  stochastic  approximation  algorithms  and  hence  can  be 
carried  out  in  computationally  simple  schemes  as  the  one 
just  presented. 
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VII.   AREAS  FOR  FURTHER  STUDY 

This  Section  is  devoted  to  stating  particular  areas 
where  further  work  may'  be  of  interest.   These  ideas  have 
been  noted  as  either  not  being  discussed  in  the  current 
literature  or  as  having  been  analyzed  when  required 
conditions  were  not  satisfied. 

A.  DEVIATIONS  FROM  THE  LINEAR  CASE 

Section  III.G  discussed  the  estimate  of  expected  squared 
error  for  the  linear  case  and  mentioned  that  other  than  a 
sampling  experiment  by  Teichroew,  very  little  analysis  had 
been  done.   VJork  needs  to  be  done  in  this  area  to  determine 
limits  of  departures  from  linearity  where  linear  results 
remain  valid. 

B.  STOPPING  RULES 

Stopping  rules  not  based  on  bounded  confidence  interva  s 
utilizing  aymptotic  normality  are  almost  nonexistent.   Whc  ■ 
is  needed  is  some  nonparametric  stopping  rule  based  on  say, 
number  of  changes  of  sign  of  (x   -  x   .,  ) .   Many  authors  have 
noted  that  this  is  a  virtually  untouched  area  yet  almost 
nothing  has  appeared  in  the  literature. 

C.  POSSIBLE  TOAKENING  OF  CONDITIONS  ON  a 

n 

In  Comer's  paper  "Application  of  Stochastic  Approximation 
to  Process  Control"  [Ref.  19],  an  error  in  the  formulation 
of  a  Robbins-Monro  process  yields  interesting  results. 
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Comer  mistakenly  used  the  step  sequence  a   =  l/(n)     in 
a  simulation  comparison.   Note  that  this  sequence  does  not 

CO 

2 
satisfy  the  requirement    Z   (a  )  <  ^o.   However  his  results 

n=l 

when  compared  with  the  same  simulation  using  a   =  1/n, 

which  does  satisfy  the  necessary  requirements,  shov/s  that 

the  sequence,  a   =  1/n   gives  comparable  if  not  superior 

results.   The  idea  to  explore  is  (1)  Comer's  simulation 

error,  or  (2),  can  the  conditions  on  a   actually  be  weakened 

in  practice  to  obtain  more  desirable  results. 


D.  REPEAT  SIMULATION  OF  THE  KIEFER-WOLFOWITZ  PROCESS 

In  a  previous  simulation  comparison  study  of  Kiefer- 
Wolfowitz  type  methods,  Springer  [Ref.  107]  used  as  a 
sequence  of  norming  constants  the  sequence  where  a.   ,  =  a  /p. 
He  discussed  the  result  of  finding  a  small  sam.ple  bias  which, 
one  should  note,  can  be  attributed  to  the  fact  that  this 

00 

sequence  does  not  satisfy  the  assumption  that    Z   a  =  °°. 

n=l  ^ 
Perhaps  a  new  simulation  study  using  proper  coefficients  i 

in  order. 

E.  MULTIDIMENSIONAL  EXTENSION  OF  DUPAC  AND  KRAL's  RESULTS 
Dupac  and  Krai  [Ref.  35]  (see  Sec.  V.F)  examined  the 

Robbins-Monro  one  dimensional  case  where  there  are  errors  in 

P 
setting  the  X-level.   They  cited  conditions  where  X  -►  G 

when  these  errors  exist.   They  noted  that  errors  of  this 

type  make  the  Kiefer-Wolfowit z  procedure  practially 

inapplicable  to  this  type  of  analysis,  but  speculated  that 

a  generalization  to  the  multidimensional  case  might  be  of 

interest. 
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